Anecdotal reports are generally regarded as being of poor evidential quality. However, we have previously proposed that some anecdotal reports of adverse drug reactions are underappreciated sources of definitive associations.[1] In this article, we extend our discussion to a detailed consideration of the relevance to pharmacovigilance of such reactions, which we propose can serve as perfect gold standards, or at least very high-grade ore.

We first reiterate briefly our previous observation, namely that there are at least four categories of adverse events that can provide definitive evidence of true drug-event associations, without the need for formal validation. The four groups, examples of which are listed in table I, are as follows:

  1. 1.

    Extracellular or intracellular tissue deposition of the drug or a metabolite.

  2. 2.

    A specific anatomical location or pattern of injury.

  3. 3.

    Physiological dysfunction or direct tissue damage demonstrable by physicochemical testing.

  4. 4.

    Infection, as a result of the administration of an infective agent as the therapeutic substance or because of demonstrable contamination.

Table I
figure Tab1

Definitive anecdotal adverse drug reactions

We have colloquially called these reactions ‘between-the-eyes’ reactions, because the diagnosis is obvious, or almost so, in the individual patient. In some cases, the diagnostic value of the event can be enhanced by further investigation, but the conclusion will always be related to the individual affected.

This gives a new perspective on anecdotal reports of adverse drug reactions, predicated on concepts that are not included in traditional causality assessment procedures, which tend to emphasise chronological characteristics of reported associations (i.e. the time courses of challenge, dechallenge and rechallenge).[45,46] We intend no criticism of these traditional criteria, which can provide good evidence for causation, for example in well documented cases of immediate (e.g. ‘end-of-the-needle’) events, in which the temporal relation is so striking in otherwise stable circumstances that alternative causes can be excluded.

We begin with a general discussion of gold standards in pharmacovigilance and other biomedical settings. We then discuss the need for gold standards in pharmacovigilance and finally the availability of such gold standards.

1. Thesis: the Demand for Gold Standards in Biomedical Science

There are many publications on gold standards in the context of medical screening, diagnostic tests and epidemiological studies. The use of an imperfect gold standard can introduce bias into the calculation of standard performance parameters (for example, sensitivity, specificity, predictive value and receiver operating characteristic curves).[4749] The significance and direction of such bias depends on multiple factors, including whether the new test and the reference standard are ‘conditionally independent’ , the quality of the imperfect gold standard (that is, how well the imperfect gold standard correlates with the outcome of interest) and the specific metric being considered.[47]

Gold standards can only provide useful performance estimates if they are highly correlated with an outcome of interest and if the research question is framed in appropriate common sense terms.[4850] Clinical scenarios often lack ‘perfect’ or ‘pure’ gold standards, because the target environment or outcome of interest is imprecise.[48,50,51] Even assessment of tissue pathology is associated with inter-observer variability.[52]

Other approaches can be used to improve estimates derived from proxy gold standards in screening, diagnosis and epidemiological studies, including discrepant analysis, latent class models and Bayesian methods.[53] There is a growing body of epidemiological research on making causal inferences from observational data when there are multiple sources of bias and error. These have yet to be specifically applied to pharmacovigilance, but they are worthy of exploration.[54,55]

2. Antithesis: the Need for Gold Standards in Pharmacovigilance

Pharmacovigilance has a dual nature: the search for ‘signals’ of adverse events that are novel in terms of their nature, severity and/or frequency is accomplished by evaluation of single reports and case series,[56,57] as well as numerical approaches akin to ‘anomaly detection’ in other forms of public health surveillance.[58] Thus, signal detection and evaluation is a multi-step process that often uses multiple tests and data streams.

Spontaneous reporting systems, composed largely of anecdotal case reports that are not peer reviewed, are currently the cornerstone of postmarketing signal detection. Such databases may be very large, sparse and plagued by data distortion and corruption at the level of the individual report. Furthermore, the overall sampling mechanism reflects a convenience sample, based on differential reporting across drugs and events. In addition, the data lack information on the numbers of patients exposed or at risk.

Causal assessments in pharmacovigilance can occur at the level of both the individual case report and the overall association (i.e. the case series). These are not independent processes, as the amount of information contained in individual cases determines the number of cases required to achieve a critical evidentiary mass for the purposes of decision making; on average, the more definitive the evidence per case (either for or against causality), the lower is the critical evidentiary mass of cases. It may be possible to pin down causally a drug-event association with sufficient certainty with only a few cases if certain causality features are well documented and to implicate the drug with high probability in each case. If the report contains less information and the probability of causation is lower per report, coincidental associations are more likely and more cases will be required to achieve the critical evidentiary mass. This underlines the need for comprehensive reporting of individual cases.[59]

With increasingly large databases and massive influxes of reports of adverse events, statistical data-mining algorithms (DMAs) are being more often promoted to assist the process of signal detection in pharmacovigilance. Most contemporary DMAs are variations of ‘disproportionality analysis’ and can be broadly classified into frequentist types (proportional reporting ratios and reporting odds ratios) and Bayesian types (Bayesian Confidence Propagation Neural Network, and the multi-item gamma Poisson shrinker).[60] There is considerable controversy about the value and optimal use of these methods.[61] Each has case-dependent strengths and weaknesses,[62] and all have significant limitations.[60]

Pharmacovigilance DMAs are typically studied by retrospective application to authentic spontaneous reporting systems (although some advocate database simulations as the definitive approach to validation).[60,63] Attempts to test, validate and interpret results from these tools are complicated by various factors, both non-methodological, such as potential commercial and intellectual conflicts of interest, and methodological.[64] The latter include confirmation bias, the abundance (or perhaps overabundance) of available data-mining options and configurations, the ad hoc and subjective nature of threshold selection, the need to assess both ‘dynamic’ (i.e. ‘time-to-signal’) and traditional ‘static’ performance metrics, such as sensitivity, specificity and predictive value, and a claimed lack of gold standards for adjudicating causality in spontaneous reports.[60,61] The last problem pertains to decisions about which spontaneously reported adverse events should be included in reference sets for testing DMAs. In real time, the detection of an event precedes its verification. In the current context, event verification is a prerequisite for assessing the procedures and tools that are used in event detection. Reference sets must be constructed from spontaneously reported associations that were unknown at the time of marketing, which can be classified as causal or non-causal with a reasonable degree of confidence (true positive and true negative, respectively), and for which there is a reasonably specific terminology in the adverse effects dictionary used to encode the data.

Causal classification requires a reasonably reliable gold standard for designating a drug-event association as a case or non-case. Here we refer to ‘case’ versus ‘non-case’ at the level of the reported drug-event association, rather than the level of individual reports. A purported association may be an ensemble consisting of definitive cases, probable or possible cases, and non-cases. That is to say, the total corpus of reports of a particular drug-event combination may include distinct subpopulations of cases, some demonstrably causal, some causal but not conclusively demonstrated as such, and some not causal at all.

3. Synthesis: the Availability of Gold Standards in Pharmacovigilance

Given the data limitations delineated in the previous section, it is not surprising that the claimed lack of gold standards for assessing causality is among the most frequently cited of all obstacles in testing DMAs. Its most extreme expression is the objection to the use of associations for which there is ‘no guarantee’ of causality. This viewpoint may be of philosophical interest, but it should not be selectively applied to shield emerging methods from scientific scrutiny or to object to findings because they are disagreeable. There are alternative ways to think about case definitions in terms of outcomes of interest that are logically sound and may have greater practical utility in real-world pharmacovigilance scenarios.[65] This includes the value of effective surveillance systems in promoting a greater awareness and understanding of complex, dynamic and uncertain environments,[66] which in the case of pharmacovigilance includes warning signs of probably causal but unproven associations with potential public health implications.

What constitutes a gold standard for designating cases and non-cases varies according to the circumstances in which it is applied and the objectives of the surveillance system. The range of observed event frequencies in patients taking a drug relative to the corresponding background incidence[56] indicates that establishing definite or probable associations (i.e. suitable reference events) across the spectrum of pharmacovigilance scenarios requires analysis of the full range of datasets from case reports, observational studies and randomised clinical trials[67] and must exploit both clinical observations and numerical data; this has been termed ‘teleoanalysis’.[59,68] Nothing is ever irrefutably proven, and even the most pure gold standard is established with a degree of probability, albeit a high one. Therefore, we believe that much of the discussion about absolute or perfect gold standards is too restrictive and demonstrates an under-appreciation of both the uncertainty of the target environment and the range of events of legitimate interest to pharmacovigilance professionals.[65]

Thus, immediate (e.g. ‘end-of-the-needle’) events, which are replicated with well documented positive rechallenges (i.e. close temporal proximity between each drug exposure and an objective event that is not a manifestation or complication of the treatment indication), corroborated by robust epidemiological methods, independently detected in large postmarketing clinical trials and/or supported by cogent clinical pharmacological data, are reasonable proxy gold standards and can provide useful information on the ability of emerging quantitative methods to help identify actionable associations, although each may not reflect the full spectrum of phenomena encountered in pharmacovigilance).[6972]

We believe that there are spontaneously reported events, largely ignored in this context, that may constitute a valuable component of reference sets used to measure data-mining performance, in that they are definitive adverse reactions that can be regarded as being pure gold or of extremely high-grade ore.

4. Definitive Anecdotal Adverse Drug Reactions

As sources of pure gold, we have specifically suggested four groups of spontaneously reported adverse events, for which causal or contributory attribution to the drug is either irrefutable or demonstrated with a high level of confidence.[1] The four groups are listed, with examples, in table I.[244] We propose that the principles illustrated define a set of gold standards that can be used to supplement those requiring clinical judgment and those defined with reproducible results in independent samples.

It should be borne in mind that the four descriptive categories and corresponding examples are not mutually exclusive or exhaustive. For example, injection-site reactions to aluminium-containing vaccines[16] have characteristics that are consistent with more than one category, since there is anatomical contiguity (category 2) and electron microprobe analysis has shown features of aluminium crystal-storing histiocytosis (category 1). Fixed drug eruptions add an additional element of physical specificity that can further strengthen associations demonstrated with traditional provocative rechallenge.

In all cases, additional evidence (for example, a convincing time course) may be adduced to boost the strength or ‘caratage’ of an association. For many of the tests, scenarios, and procedures described in the following sections, such as photopatch testing and physiological testing of sweat and salivary gland function, positive rechallenge can also contribute, in which case blinding and the use of a placebo can further increase the evidentiary value of a single case. For example, positive rechallenge in an individual case is enhanced when the patient is unaware of the re-administration[73] or when placebo-controlled rechallenge is used.[74]

4.1 Extracellular or Intracellular Tissue Deposition of the Drug or Metabolite

The first category includes adverse events in which the injury is due to either extracellular or intracellular deposition of a drug or metabolite (i.e. the pathological lesion is composed of the drug or metabolite) as demonstrated by objective physico-chemical testing. The feasibility of such testing in these cases usually implies that the lesion itself is either extrinsic to body tissues or involves tissues or body fluids that are accessible for biopsy or some form of in situ examination. These events can be considered pure gold in the counterfactual sense,[75] in that the events could not have occurred in the absence of the drug.

It is important to recognise the distinction between a lesion caused by the compound and one in which the compound is an innocent bystander. By way of example, renal stones associated with efavirenz have been described as containing 50% metabolites of efavirenz and 50% unspecified proteins;[76] this may not be enough to exclude an innocent bystander effect, since the primary phenomenon may have been protein concretion. In well documented cases in which the calculus is composed entirely or predominantly of drug or metabolite, as has been reported for example with triamterene,[5,6] one can exclude pure confounding and infer either that an adverse drug reaction has occurred or at least that the drug played a major contributory role by way of a drug-disease interaction.

4.2 A Specific Anatomical Location or Pattern of Injury

The second category includes adverse events in which the anatomical location and/or pattern of injury is sufficiently specific to attribute the effect to the drug without the need for implicit judgment or formal investigation. The mechanism of injury can be related to physicochemical or pharmacological properties of the drug.

It is tempting to add certain intramuscular injection site reactions in this category, but caution is warranted. A distinctive example is Nicolau’s syndrome (embolia cutis medicamentosa), a rare, acute, necrotic, livedoid dermatitis reported with intramuscular injection of various drugs, including bismuth, modified-release formulations of penicillin, NSAIDs and glucocorticoids.[7780] Experimental evidence supports the theory of inadvertent intra-arterial or paravascular injection, leading to structural or functional vascular occlusion, but the relative contributions of needle injury, volume effect and/or physicochemical characteristics of the drug (for example, fat emulsion or microcrystal deposition) have not been conclusively determined. Similarly, reports of injection-site reactions with intramuscular and subcutaneous injections of drugs that are contained in bottles or cartridges containing latex plungers and diaphragms may reflect latex allergy rather than a reaction to the drug.[80,82] Nevertheless, in these cases, the event could be classified as a definitive adverse reaction to the whole formulation rather than to the drug itself.

4.3 Physiological Dysfunction or Direct Tissue Damage Demonstrable by Physicochemical Testing

The third category includes adverse events that involve physiological dysfunction or tissue damage for which documentation by physicochemical testing is feasible. Drug-event associations in this category may not all be pure gold, in that some of the confirmatory test procedures may not be foolproof and may be operator or situation dependent. However, when properly performed and interpreted they can provide a level of confidence in attribution suitable for informed decision making in pharmacovigilance. An example is photopatch testing for photo-allergy. False-negative results are problematic in this setting: a positive result could reflect a local irritant effect of the drug interacting additively with a subclinical effect of the ultraviolet light source. Nevertheless, when drug versus vehicle and irradiated versus non-irradiated controls are properly used, these tests have been reported as being diagnostic in individual cases.

4.4 Infection, Either Due to the Administration of an Infective Agent as the Therapeutic Substance or to Demonstrable Contamination

The fourth category includes adverse drug reactions related to infections. These can be due either to the contaminating presence of the organism or because the product itself consists of live microbes. Confirming causality would involve proving identity between the infecting organism and the organism contained in the product and/or confirmation of matching batch-specific distribution of the contamination and the event.

5. Discussion

We have proposed a framework, with illustrative examples, of types of drug-event associations that can be considered to be definitive (‘between-the-eyes’ adverse effects) based on one or a small number of cases.[1] Preliminary data-mining analyses have already been performed for some of these associations.[6]

Given the unique character of pharmacovigilance, namely the variety of events under surveillance and the need for qualitative and quantitative probabilistic assessments at the level of single reports and causally heterogeneous case series, it is appropriate to consider how we can test the decision-making potential of pharmacovigilance systems and tools. A key question is the construction of reference sets of suspected adverse drug reactions classified with a reasonable degree of confidence as true-positive associations. Definitive adverse reactions could contribute to that.

The adverse drug reactions we have discussed here do not represent a ‘skeleton key’ that will unlock the gate to a full understanding of the performance of pharmacovigilance tools and systems. They represent one subset, and a small one at that, of the ‘sample space’ of adverse drug reactions. One could not validate pharmacovigilance systems by extrapolating from this subset of events alone. However, the range of test scenarios and datasets should mirror the richness and variety of real-world pharmacovigilance; these types of events could form a valuable part of that range.

A fair question, along similar lines, is what is the relevance of findings from quantitative pharmacovigilance tools to reported adverse drug reactions that are definitive or highly probable based on clinical specificity? Should the application and testing of quantitative tools be limited to adverse events that are distinctive only by virtue of quantitative representation? Our response is 3-fold. If, in the real world, each and every spontaneous report from the first submission was definitive, and thus immediately recognisable, quantitative methods would offer little incremental value for prospective surveillance of these types of events. However, pharmacovigilance experience shows that complete documentation is the exception rather than the rule, and that there may be a significant time lag until the first well documented report. Secondly, a potential function of DMAs is to provide a safety net against human cognitive gaps in signal detection procedures based on manual review. Finally, the use of such events to help establish ‘assay sensitivity’ is related to, but distinct from, the question of how these events could be detected prospectively. In other words, the ability of quantitative tools to highlight such events may provide an additional element of reassurance in their power to detect credible phenomena. Therefore, these types of events are relevant both from the perspective of testing and for prospective surveillance.

It is also important to evaluate ‘true negatives’ when testing classifier technology. This is challenging, since today’s ‘true-negative’ association could become tomorrow’s ‘true-positive’ association in the light of additional data. We offer the following possibilities, more as a starting point for discussion rather than as foolproof solutions.

Perhaps the most basic approach is to use reported drug-event associations that are generally regarded as non-causal after extensive experience with the drug. These could suffice for real-world pharmacovigilance purposes, despite objections based on theoretical or philosophical considerations of absolute certitude. There are other possibilities. For example, collections of spontaneous reports may contain helpful data, i.e. as of the fourth quarter of 2004, there were >1500 adverse event reports listing placebo as a suspect medication. Studies of adverse non-drug events and nocebo responses are notable for the inclusion of non-serious and/or subjective events, but some of the data reported with placebo in the US FDA Adverse Event Reporting System (AERS) database are medically serious and can be associated with statistical disproportionalities.[83] These reported associations might be considered as true negatives in drug-adverse event reference sets. Although some may object that these do not truly represent spontaneous reports, spontaneous reporting system databases are plagued by numerous forms of reporting artifacts, leading to reporting that is not truly spontaneous; this is just one example in which the artifact is readily identifiable as such, and for which conclusive causality assessment can be made. In addition, some associations that were previously classified as true positives have been discounted over time (for example, ‘phantom ships’),[84] and some of these may serve as true-negative reference associations. A good example is congenital anomalies reported anecdotally with Bendectin®/Debendox® Footnote 1 (doxylamine/dicycloverine/pyridoxine), which was thoroughly discounted over time,[85] yet has been shown to be associated with a signal of disproportionality in one preliminary data-mining exercise.[86]

Other possible true-negative reference events would be those that challenge fundamental scientific principles, i.e. the reporting of phototoxicity with potassium chloride, in which the absence of a required action spectrum would be sufficient to refute the reported association.[87]

When evaluating signal-detection methods with the aforementioned reference associations and concepts, the usual prerequisites apply: initial detection after marketing authorisation, the existence of spontaneous reports and a reasonable dictionary coding scheme to represent the reported effects. In the absence of detailed case narratives, and when the confirmed case is a literature report that is not contained in the database used for analysis, it is assumed that one or more reports of the association in the database are clinically consistent with the confirmed case. Therefore, not every example will necessarily be applicable, but collectively they can be building blocks in reference sets used to test pharmacovigilance systems. The associated concepts can be used to identify additional events worthy of consideration. We also hope that this analysis will stimulate discussions of causality assessment of individual reports and optimising clinical cognition in signal detection and evaluation.

6. Conclusions

In this article, we have established the notion that isolated reports can be definitive in constructing gold standards in pharmacovigilance. We stress that we are not proposing a new classification of adverse drug reactions; we have merely described categories of reaction that can be regarded as definitive when described anecdotally. Our list of categories is probably not exhaustive. In addition, although we have identified adverse effects that would not need confirmation in formal studies, studies with independent datasets would still be necessary to quantify the risk.

What we have not established here are best practices for designing and interpreting validation and testing using such gold standards. Our analysis does not obviate the need for continuing discussions about broader issues in the testing of emerging pharmacovigilance technologies, such as the inherent limitations of all forms of disproportionality analysis, the relevance of tests in isolation (the analogue of so-called ‘test research’) versus the study of the incremental contribution of the new tools to existing methods (so-called ‘diagnostic research’)[88] and the optimal deployment of emerging technologies within comprehensive signal detection programmes based on multiple methods and data streams.

Perhaps more fundamentally related to the issue of gold standards is the nature of a ‘signal’ itself. Some formulations of signal detection performance are based on the availability of associations that are provable without any residual uncertainty. We think this may be a limited and unproductive approach that misses the value of ‘situational awareness’ in real-world pharmacovigilance.[66]

Spontaneous anecdotal reports have limitations, but we believe that they can also be rich sources of gold nuggets, which can be mined for possible inclusion in reference sets for testing any method used to support the signal detection process. This is in the spirit of others who have discussed the tendency to undervalue observations from case reports.[89] Because the types of events we describe are diverse but relatively uncommon, pure gold and high-grade ores do not substitute for reference events traditionally used to test signal-detection tools. By themselves they would constitute a quantitatively inadequate and biased reference set. Rather, we believe that they can supplement and enrich the usual approaches.