FormalPara Key Points

Sequential Probability Ratio Tests (SPRTs) may have a role in detecting signals from spontaneous reports of suspected adverse drug reactions. They have somewhat different properties to other commonly used statistical methods for that purpose.

Using a combination of variable hypothesised relative risks (hRRs) allows for the detection of different types of adverse events (AEs). For drug–AE pairs that are rare with low expected counts, we need to apply a higher hRR for the SPRT method to pick up signals of disproportionate reporting of concern. On the other hand, a lower hRR will be useful for the more common drug–AE pairs.

1 Introduction

Post-marketing surveillance of drugs and vaccines is important to minimise risks with marketed drugs. In practice, spontaneous suspected adverse drug reaction (ADR) reporting remains the main source of information for regulators in this monitoring [1]. Pharmacovigilance distinguishes true ADRs from ‘adverse events’ (AEs) that are not caused by the drug in question. A signal is a potential safety concern that a drug may be associated with a previously unrecognised hazard requiring further investigation [2].

Since 1993, the Vigilance and Compliance Branch of the Singapore Health Sciences Authority (HSA) receives spontaneous local ADR reports from healthcare professionals (e.g. 83.2% from clinicians, 12.2% from pharmacists, 2.7% from other healthcare professionals and 1.9% from pharmaceutical companies) via facsimile, mail or online. Its Spontaneous Reports System (SRS) database has, in recent years, also received reports from the public healthcare institutions in real time via the Critical Medical Information Store (CMIS), and as a result, there has been a major increase of 40-fold in the number of ADR reports received regularly [3]. The numbers of reports received annually is now about 20,000.

Statistical ‘data mining’ methods emerged in the late 1990s to complement the traditional manual review, and these are commonly called ‘disproportionality analysis’ [4, 5]. Some examples of frequentist statistical methods are the proportional reporting ratio (PRR) and reporting odds ratio (ROR) [6,7,8]. Examples of Bayesian methods are the Bayesian Confidence Propagation Neural Network (BCPNN) [7, 9], Gamma Poisson Shrinker (GPS) [10] and Multi-item Gamma Poisson Shrinker (MGPS). The MGPS is now used by the US Food and Drug Administration (FDA) and the UK Medicines and Healthcare products Regulatory Agency (MHRA) [11, 12]. All these methods are based around the ratio of observed-to-expected counts of reports to obtain signals, and many studies have shown that no single signal detection algorithm (SDA) provides uniformly better performance [13]. The commonly used methods do not allow for multiple looks at the accumulating data over time, which can result in large numbers of false positive findings [12].

One method, the Sequential Probability Ratio Test (SPRT), has less concern associated with multiple testing over time; it is specifically designed to make allowance for multiple looks at data over time.

SPRT was developed by Wald in the 1940s [14, 15] and has mainly been used in process monitoring. A review of the literature showed that the SPRT may offer advantages over the other methods to overcome multiple testing problems [16,17,18]. It has been used in the context of scanning electronic health records, but has not had extensive evaluation for spontaneous reports.

Specifically, SPRT compares two hypotheses based on the likelihood of observing the data given those hypotheses [17, 19]. Unlike the other methods, SPRT is based on the difference between (rather than the ratio of) the observed and expected values. However, there is limited research conducted on SPRT to evaluate its performance in an SRS database similar to Singapore’s, and a previous evaluation of SPRT used a single alternative hypothesis on simulated data [20]. On theoretical grounds, at least, the methods that do not allow for this form of multiple testing over time will have a higher rate of false positive findings—false signals.

This paper explores how SPRT behaves and reviews its utility and applicability to pick up signals of disproportionate reporting (SDRs) for potential drug safety signals in the Singapore context. We also compare the performance of SPRT with three other SDA, namely ROR, BCPNN and GPS.

2 Methods

The spontaneous reporting database in Singapore has been described by Ang et al. [21]. In the database, each valid report has at least one product and at least one suspected ADR term included. Products are coded using standardised drug names, and adverse reaction terms are coded using the World Health Organization (WHO) Adverse Reaction Terminology (WHO-ART) (version 151) [22].

The SPRT method requires that specific hypotheses regarding a relative risk to be detected are set out, and in the context of signals of ADRs, some arise from relatively frequently occurring AEs where small relative risks are nevertheless potentially important, while others are from rare events where only higher relative risks can be detected. The details of the SPRT method are described in the “Appendix”, together with a brief description of the other methods used here, including the criteria that determine whether the counts for particular drug–AE pairs constitute a signal.

We analysed signals using all data from 1993 to 2013, and in additional analyses, also reviewed the data as they were up to 2011 and examined the new signals that arose in 2012 and 2013, mirroring what is done in practice with accumulating data. We also classified the drug–AE pairs in terms of seriousness of AE and whether the AE was labelled for that drug or not.

We evaluated the methods, as most other comparisons have done, using sensitivity, specificity, negative predictive value (NPV), and positive predictive value (PPV), considering factors such as whether the pair is a significant SDR and whether the drug–AE pairs are labelled. The drug–AE pairs were reviewed by a senior pharmacist and considered as labelled if they were mentioned in the Micromedex® [23] or current regulatory agency-approved drug labels in Singapore. It could be exact word-to-word matching or synonymously labelled (same meaning). The AE terms were considered as serious if they exist in the WHO critical terms list or were considered medically significant suspected serious ADRs by the US FDA or in the Important Medical Event Terms (IME) list developed by the EudraVigilance Expert Working Group [22, 24, 25]. All the analyses were performed using R software, version 3.3.1 [26], including a signal detection package PhViD [27].

3 Results

The HSA received a total of 151,180 AE reports from 1993 to 2013; these reports involved 23,183 unique drug–AE pairs. There were 1569 different suspected drug substances and 1014 different AE terms. If every drug had reports for every possible AE, then all possible combinations of drug (1569 drugs) with AE term (1014 terms) would mean that there were potentially 1,590,966 possible pairs (1569 × 1014). Of those possible combinations, only 23,183 (1.5%) unique pairs actually occurred. The counts in each of these 23,183 cells had each of the methods applied to them.

The effect of the two hypothesised relative risks (hRRs) used for detecting signals using SPRT with hRR = 2 or hRR = 4.1 is shown in Table 1, where we show the distribution of observed and expected counts where signals are detected, giving medians and 5th and 95th centiles. We analysed the SDRs in terms of seriousness of AE and whether the AE is labelled or not; for the period from 1993 to 2013, a total of 137 unique serious and non-labelled drug–AE pairs were signalled by SPRT. Of the 137 drug–AE pairs, 88 drug–AE pairs were not picked up by hRR = 2, as the number of observed counts was less than five.

Table 1 Comparing SPRT with hRR = 2 and hRR = 4.1 in terms of seriousness of AE, whether the AE is labelled or not labelled, observed and expected values

The ROR and SPRT methods detected more SDR compared to BCPNN and GPS. Figure 1 is the Venn diagram that shows the overlap of significant pairs detected by each method for the complete data from 1993 to 2013. The SDRs detected by the BCPNN method are a subset of the other methods. It is clear from this that the large majority (70%) of signals were detected by all methods (2187/3106). SPRT detected 268 signals (N < 3) that were not detected by other methods, while ROR detected 400 signals that were not detected by the other methods.

Fig. 1
figure 1

Venn diagram for data from 1993–2013 to illustrate significant SDRs by the four methods and their inter-relationships (shapes not drawn to proportion). BCPNN Bayesian Confidence Propagation Neural Network, GPS Gamma Poisson Shrinker, ROR reporting odds ratio, SDRs signals of disproportionate reporting, SPRT Sequential Probability Ratio Test

Comparisons were done by reviewing the numbers of new signals based on drug–AE combinations that had some reports in the database prior to that year but were not signals previously, and totally new signals, where the combination had reports for the first time in the relevant period. Table 2 gives the number of new significant SDRs for each method for different quarters from 2012 to 2013. In general, SPRT tends to generate a higher percentage of new significant pairs compared to the other methods.

Table 2 Breakdown of number of SDRs generated by each method according to different quarters of years

To evaluate which methods performed better than the others, we reviewed the significant SDRs in terms of seriousness of AE, whether the AE is labelled or not labelled, PPV, NPV, sensitivity and specificity. In this analysis, the SPRT method detected more not labelled drug–AE pairs. In terms of PPV, ROR, BCPNN and GPS performed better than SPRT. In terms of sensitivity, ROR performed better than other methods. The performances of the methods were similar for NPV and specificity (see Table 3).

Table 3 Comparisons of methods in terms in terms of seriousness of AE, whether the AE is labelled or not labelled, PPV, NPV, sensitivity and specificity (1993–2013)

4 Discussion

We have shown that the SPRT method has some different properties to the other methods and that it can be ‘tuned’ to detect signals for rare events as well as more frequent ones. The fact that the hRR has to be pre-specified, while apparently a disadvantage, can be utilised to obtain signals in different circumstances. This method may be suitable for databases with smaller total numbers of reports and where a signal would be detected even with smaller numbers, compared with databases containing many millions of reports, such as those of the FDA and the EU.

More work may need to be done to investigate its use in practice and whether it should be an additional or an alternative method for use in the context of smaller databases. There is no general ‘gold standard’ to define which of the drug–AE pairs are really true ADRs and which are not. Methods like ROR and GPS that have been used in the past may have led to labelling, but it is not certain that all such associations are true ADRs. Individual regulatory authorities may need to examine the characteristics of the signals detected and not detected by the different methods in their own data.

Singapore, although having a high reporting rate based on number of AE reports received per million inhabitants, is a small country and, therefore, its total number of reports is not that high. In this situation, there are drug–AE pairs of interest with small numbers of reports, and detecting SDRs using SPRT only with a small value of hRR will be problematic.

For more rare events, it could be useful to adopt a higher hRR for early signalling purposes. For example, dabigatran (anticoagulant) and cerebral infarction was signalled earlier by hRR = 4.1 when the number of observed counts was two. Using hRR = 2 only gave a signal 9 months later when the number of observed counts reached five. ROR, BCPNN and GPS signalled it 3 months later than SPRT. Diltiazem (antihypertensive/anti-anginal) and vestibular disorders was signalled with hRR = 4.1 in 2011, but not by hRR = 2, ROR, BCPNN or GPS. Diltiazem and vestibular disorders is not included in the product label, but studies have shown migraine-related dizziness or vertigo have been reported in 7% of patients [28]. Letrozole (anticancer) and epidermal necrolysis was signalled with hRR = 4.1, ROR, BCPNN and GPS in 2011, but not by hRR = 2. Letrozole and epidermal necrolysis is included in the product labels as either uncommon or rare. Vancomycin (antibiotic) and acute generalised exanthematous pustulosis was signalled using hRR = 4.1, ROR and GPS in 2011, and 6 months later by BCPNN. These findings suggest that SPRT could have a useful role, but it is not clearly superior to other methods.

There are some signals probably resulting from confounding by indication, such as clozapine (antipsychotic) and neurosis, dapsone-pyrimethamine (combination of antibiotic and antimalarial) and infection, dasatinib (anticancer) and malignant neoplasm, hepatitis B immunoglobulin and viral hepatitis, pentamidine (antimicrobial) and pneumonia, and rivaroxaban (anticoagulant) and melaena. They may also be markers of the drug being ineffective, but deciding which is true is difficult, if not impossible, from spontaneous reports.

While the SPRT method is intended to allow for multiple looks at accumulating data, it does not explicitly address other issues of multiplicity. There are over 20,000 drug–AE pairs that are tested, and none of the methods make explicit allowance for this form of multiplicity. Here, reports are of suspected ADRs, so the possibility that they are all chance effects is not tenable, and the application of Bonferroni types of correction would be too extreme and lead to a notable loss of power. False discovery type methods, as described by Gould [12] and Ahmed et al. [29], do address these forms of multiplicity, and they could be applied to SPRT methods as well.

Therefore, it is not possible to choose the hRR to be detected based solely on statistical grounds. Furthermore, it should be noted that the SPRT is a sequential test, and applying it to an existing database is not the most appropriate approach to its evaluation. The most appropriate approach would be to look at the newly arrived data and see what SDRs are detected. It is also clear that the actual hRR that is most likely to be a real effect is of relevance. It is very likely that very high hRRs for reasonably common effects will have been detected in randomised trials used for licensing. However, extremely rare reactions would not be detected, and spontaneous AE reports are the best tool for detecting them.

Having a large value of hRR will generate SDRs at very low observed or expected counts, but at high observed or expected counts, the signals might be missed. Hence, a detailed analysis of the effect of different hRR values on the database is necessary, and we have tried having different thresholds or hRR values for different observed counts or expected counts, but this does not seem practically sensible. However, having different thresholds for different types of AEs, depending on their rarity, may well be sensible. This may not be easy to define, but is worth exploring in the future.

5 Conclusions

It appears that SPRT may have some applications in the Singapore’s SRS. For AEs that are rare and thus expected to have low expected counts, applying a higher hRR for the SPRT method may pick up SDRs of concern. On the other hand, AEs that are more common need a lower hRR to weed out false positives. To appreciate the value of SPRT in the Singapore database, more in-depth analysis comparing the value of the signals picked up by varying the hRRs would be a useful next step of investigation. Other countries, especially with smaller databases, may find this simple method of SPRT can be applied very easily to their databases and may provide signal detection for some rare events of significance to them. Assuming they have a database and the ability to produce the counts of the pairs, then it is easy to apply the method, and this could be done using any spreadsheet or statistical software.