FormalPara Key Points

Internet search query methods provide an additional approach to examining a new and large data resource of internet search query data for signal detection of adverse events.

The best performance of the individual search query metrics was found with the Q-PRR and PQR metrics which had high sensitivity and moderate specificity to detect signals in web query data compared to metrics used for signal detection with FAERS database.

These new query metrics provide the ability to get faster results for the signal detection of adverse events than the current standard spontaneous adverse event reporting databases which have a long lag time in data releases.

1 Introduction

Multiple resources for signal detection exist. Commonly used spontaneous adverse event (AE) reporting systems such as the US FDA AE Reporting System (FAERS) and the World Health Organization’s VigiBase include large amounts of data, but these systems have their limitations and may miss AEs [1]. It has been estimated that spontaneous AE reporting systems may under-report up to 98% of adverse drug events that occur [2], so alternative data resources have been considered [1, 36].

Signal-detection algorithms (SDAs), such as disproportionality and Empirical Bayes geometric mean (EBGM) metrics, have been used as the primary tools to detect signals in spontaneous AE reporting systems such as FAERS. However, these metrics were developed for use in spontaneous AE reporting systems and may not function appropriately with other data sources. Internet search engine logs and social media data have been proposed as supplemental data resources to mine for drug safety signals; however, the methods with which to process and analyze these large data sources are still under development [5, 7, 8]. Recently, the US FDA provided some draft guidance on how to utilize internet-based data sources for drug safety signal detection in the general population [1, 9, 10].

Internet search engine logs and social media data are new data sources that provide a faster detection system with a larger scope in which to capture health-related signals than traditional surveillance methods. While these internet ‘big’ data sources are plentiful, they also have limitations as the data are very noisy, they may produce many false-positives, and their performance in detecting signals has not been well validated. Some techniques that have been implemented have been noise-reduction techniques applied to the internet data (e.g., removing outliers, timing between drug and event, setting high thresholds) prior to analysis [5, 7] or restriction of data sources to datasets of known associations and medical professionals [11]. For example, a recent study by Odgers et al. [11] showed that search logs from healthcare professionals in a medical database can be used as a predictive data resource for detecting adverse drug reactions (ADRs) of well-established drug–event associations as well as a set of recently established ADRs. However, there is interest in examining all internet searches (as well as social media posts) to represent both healthcare provider and consumer input and capture a full picture of all potential ADRs.

The objective of this study was to assess the performance of newly proposed query metrics in detecting signals from web query log data compared with the signals detected with reference SDAs in the FAERS database.

2 Methods

2.1 Data Sources

We extracted all English language queries submitted to the Microsoft Bing search engine by users in the USA for the period 1 March 2013 to 30 September 2013. For each query, we extracted the query text, time, date, and an anonymized user identifier. To maintain user privacy, identifiers were first anonymized by hashing before the investigators gained access to them. They were then aggregated prior to analysis, and no individual-level user datum was examined by the experimenters. Data extraction and normalization processes were performed in a manner similar to that described in a previous publication [7].

Ten single-agent drugs marketed in the USA by Sanofi were identified and used in this analysis. The selected medications included drugs with different characteristics, including those from different drug classes, from different therapeutic areas (cardiovascular/diabetes, immunology, oncology, and central nervous system agents), and drugs with different administration routes (pills, injectables, etc.). Several drugs were also new to the market in the last few years, whereas others have been on the market for decades and are available as over-the-counter medications. These drugs were selected because Sanofi is the market authorization holder for these specific drugs and the authors were familiar with the drug’s safety profile from multiple internal sources. The drugs’ indications were widespread to ensure the metrics would capture signals over a widespread number of indications and events.

Health event-related terms were taken from a list of 195 symptoms from the International Statistical Classification of Diseases and Related Health Problems (ICD) and expanded with synonyms to capture nonmedical terminology used by patients in a web search (see Table S1 in the Electronic Supplementary Material [ESM]) as described previously [7]. These health-related terms were then mapped to ICD, Ninth Revision, Clinical Modification (ICD-9-CM) codes as described previously in a publication that introduced the query log reaction score (QLRS) method for signal detection. The ICD-9-CM terms were then mapped manually by investigators between the medical symptoms found in the query log data and the Medical Dictionary for Regulatory Activities (MedDRA®, v.17) preferred terms (PTs) listed in FAERS. The first step was to group the medical terms or symptoms with the same meaning or those with typographical errors (‘typos’), i.e., amnesia or can’t remember. They were then standardized into the most relevant medical terms and encoded into the MedDRA® PTs.

Web search queries were filtered to include only those that contained one of the above-mentioned health events and one of the ten medicines identified by brand name. The query data contained 1580 possible pairs. FAERS data contained 22,103 drug–event pairs, of which 898 drug–event pair combinations overlapped with the query data (see Fig. 1).

Fig. 1
figure 1

Time periods covered by the US FDA adverse event reporting system (FAERS) database and search engine query log data and the amount of overlap

For validation purposes, data from the FDA spontaneous reporting database, FAERS, were used to assess whether the same signals could be detected in both data sources using SDAs. Empirica™ Signal 7.3 system (Oracle®) was used to extract and clean the raw safety data from the FAERS database. The AEs reported for the ten drugs of interest were extracted from the FAERS database within the time period 1 January 1968 to 30 September 2013; this search retrieved 22,103 records of drug–event pairs for the drugs of interest and reported health events coded by MedDRA® PTs and standardized MedDRA® narrow and algorithmic query terms. When compared with the 1580 drug–event pairs obtained from the query log data, some FAERS records matched multiple times. Thus, 22,224 records from FAERS were included in the database for the analysis. The MedDRA® terms captured in this database were then mapped to ICD-9-CM terminology via a MedDRA® to ICD-9-CM code dictionary.

2.2 Signal-Detection Algorithms

2.2.1 Metrics Applied to the US FDA Adverse Event Reporting System (FAERS) for Signal Detection

The current standard SDAs used for disproportionality analysis (DPA) in spontaneous AE reporting databases include the multi-item Gamma Poisson shrinker (MGPS) program providing the EBGM with the lower bound of the 90% confidence interval (CI) of EBGM (EB05), and the proportional reporting ratio (PRR) [12, 13].

Two commonly used measures for signal detection based on DPA in the FAERS database were selected as reference measures for our analysis (Table 1) [1315]. First, a frequentist method commonly used is the PRR, which measures the proportion of reports of an event for drug A relative to that for all other drugs. For the PRR, a commonly used SDA consists of a PRR ≥ 2, with the number of reports (N) ≥ 3, and a PRR χ 2 ≥ 4 [15]. A Bayesian method, the MGPS program, uses a stratified full independence model and derives adjusted ratios of the observed versus expected counts. The MGPS program estimates an EBGM that approximates a χ 2 statistic: EBGM = (a/M 1)/(N 1/N), and the most commonly used signal threshold for this statistic is EB05 ≥ 2 [13]. Thresholds of EBGM ≥ 2 and EBGM ≥ 4 were also considered as evidence of a signal of disproportionate reporting in this analysis. These aforementioned SDAs and threshold cutoffs were considered the reference metrics for which to compare the performance of the query metrics as part of this validation exercise.

Table 1 2 × 2 table for detecting disproportionate reporting of adverse events from spontaneous reporting database

2.2.2 Metrics Applied to Search Engine Query Data

The search engine QLRS metric was the first metric compared with the reference SDA metrics.

The QLRS is computed as the χ 2 score of a two-by-two table (Table 2) comprising (1) the number of people who asked about an event before versus after they first asked about the drug and (2) the number of people who asked about the same event before versus after an arbitrary date. The latter are used as a normalization factor, e.g., for seasonal events, etc. Since there are no established thresholds for the QLRS metric, which equates a χ 2 statistic, this study examined two cutoffs: QLRS ≥ 5 (median value for QLRS distribution) and QLRS ≥ 10. The QLRS test measures the statistical probability of a drug and event association, but it does not measure the strength of an association between the event and a drug, so an alternative metric, called the proportional query ratio (PQR) was developed. The PQR represents the proportion of queries for an event on or after day 0 (the day when that user first searched for the drug) in users searching for a drug relative to the proportion of event queries in users not searching for the drug (Table 2). For users who did not search for the drug, day 0 is the midpoint between the first and last day of their queries.

Table 2 2 × 2 table for estimating disproportionate queries from web-based query log data

When the query log data were considered in a manner similar to DPA PRR (see the 2 × 2 table in Table 3), a new approach was created and named the query proportional rate ratio (Q-PRR). The Q-PRR approach discards non-drug search queries and does not count events occurring before the first day of a drug name search. The Q-PRR examines the number of queries for the drug of interest and the number of queries for the event of interest compared with the number of queries for all other events. From Table 3, the Q-PRR = (a/N 1)/(b/N 2).

Table 3 New 2 × 2 table for query proportional rate ratio (Q-PRR) metric for web-based query log data

Additional factors were examined (e.g., sample size [number of drug–event pairs], exclusions, and cutoff values) to see whether they would improve the metric’s performance, and a combination logistic model of all these factors was developed. Other measures examined included the number of drug–event pairs for QLRS metric (N-QLRS) and Q-PRR (N-Q-PRR), outliers, PQR > 1, and the χ 2 of Q-PRR. These measures were then used together in a composite model for testing. The composite model was created with a logistic regression model that included each query test metric and the additional factors listed above (e.g., outliers, sample sizes, etc.) for a total of 11 independent variables to predict overall outcome (FAERS metric outcome).

2.3 Statistical Analysis

The sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) of the query metrics were calculated to validate the method against the reference SDAs used with the FAERS database. Receiver operating characteristic (ROC) curves [16] were performed to measure the performance of the query metrics, and the area under the curve (AUC) was estimated to show how well a test discriminates between positive and negative signals. C statistics were used to calculate an AUC that reflected the probability of concordance. The C statistic is a critical and equivalent parameter to measure the area under the ROC curve (AUC), which estimates the predictive accuracy of a logistic regression model [17].

Sensitivity testing regarding the query data was examined by excluding outliers and drug–event pairs with small sample sizes (where fewer than 100 Bing users searched for both drug and event) to see whether this modification would improve the performance of the query metrics. Outliers were defined, according to previous work by Yom-Tov and Gabrilovich [7], as the five symptoms found using a greedy selection process to most reduce the value of the correlation between QLRS and FAERS counts.

Next, a sensitivity analysis was also performed on variations of the PRR reference SDA by increasing the minimum number of drug–event pairs needed to be considered a signal to assess its concordance with the query metrics.

3 Results

Figure 1 shows the time periods covered by both databases and their period of overlap. There were 22,103 drug–event pairs in the FAERS database and 1580 drug–event pairs in the search engine query log database. The 898 drug–event pairs found in the FAERS and the search engine query log databases were used for comparison in this analysis. Distributions of the web query metrics are shown in Fig. S1 in the Electronic Supplementary Material.

3.1 Sensitivity/Specificity Performance

The sensitivity and specificity of the query metrics on all 898 drug–event pairs were estimated with the reference SDAs in FAERS data (Tables 4, 5, 6). When all drug–event pairs were analyzed, the QLRS had a sensitivity ranging from 46.6 to 72.7% and a high false-positive rate (1−specificity), between 50.39 and 51.28% depending on which of the four SDAs were used in the FAERS database (Table 4). The sensitivity improved greatly for QLRS when the drug–event pairs with small sample sizes (n = 292 pairs) were excluded from analysis, but the false-positive rate increased dramatically to around 75% for the reference SDAs. The corresponding PPV was very poor, never achieving higher than 11.87% with any of these SDAs, but the NPV remained high, ranging from 86.00 to 99.32%. Removing outliers (N = 35) in addition to drug–event pairs with small sample sizes from the data did not measurably change the performance of the QLRS.

Table 4 Query log reaction score metric sensitivity and specificity (%)
Table 5 Proportional query ratio metric sensitivity and specificity (%)
Table 6 Query proportional rate ratio metric sensitivity and specificity (%)

The sensitivity and false-positive rate were slightly better when using the PQR than with QLRS, around 47.06–81.82 and 43.88–44.16%, respectively, when tested using all 898 drug–event pairs (Table 5). Again, the sensitivity improved from 64.81 to 80.00% when drug–event pairs with small sample sizes were excluded; however, the false-positive rate remained moderately high (57.84–58.39%). The best PPV was found for PQR against the combined metric using PRR SDA, but the PPV was still quite low, below 14%. Removing outliers also did not change the overall performance of the PQR.

The Q-PRR metric showed a stronger specificity with the reference for all drug–event pairs (66.29–67.77%; false-positive rate 32.23–33.71%) but exhibited a lower sensitivity (43.75–54.55%) than PQR (Table 6). Excluding smaller sample sizes improved the sensitivity somewhat (50.00–68.52%) but also increased the false-positive rate (42.57–57.84%). Similar to PQR, the Q-PRR had the strongest PPV and NPV values when tested with the reference combined SDA using the PRR, N, and χ 2 metric. Overall, the best performance for the Q-PRR was found with query data that excluded smaller samples where the NPV was estimated at 92.52% and PPV at 18.01% when tested against the reference PRR SDA.

When excluding the smaller sample sizes and outliers, the sensitivity of each of the metrics increased and, as expected, the false-positive rate of these tests also increased (specificity of tests decreased). In summary, of the query metrics tested against the reference SDAs, the Q-PRR metric had the lowest false-positive rate (32.2%; Table 6) and the QLRS had the highest false-positive rate (51.2%; Table 4) when compared with the reference PRR SDA based on all drug–event pairs.

3.2 Area Under the Curve (AUC)

The performance of each query metric compared with a reference measure was graphed and the AUC measured for all 898 drug–event pairs. Table 7 shows the AUC value for the performance of each of the query metrics against the reference metrics. For the QLRS metric, the best concordance was found with the reference of EBGM ≥ 2, and this AUC estimate did not deviate much from random expectations at 0.536.

Table 7 Performance area under the curve of the query metrics and reference for adverse event reporting system

The PQR metric performed best with the EBGM SDA (EBGM ≥ 4) and resulted in a higher AUC estimate of 0.652 (Fig. 2). Lastly, the Q-PRR metric had an AUC value (0.652) similar to the PQR with the EBGM ≥ 4 cutoff (Fig. 3). While there was no statistically significant difference between the PQR and Q-PRR AUCs, both of these metrics were significantly different from QLRS AUC (p = 0.05, two-tailed test) when compared with the signals found with EBGM ≥ 4 SDA.

Fig. 2
figure 2

Receiver operating characteristic curve for PQR metric with EBGM ≥ 4 signal-detection algorithm

Fig. 3
figure 3

Receiver operating characteristic curve for Q-PRR metric with EBGM ≥ 4 signal-detection algorithm

When determining the performance of the metrics using AUC curves, the Q-PRR (and PQR) had the highest value (0.65) when using the full dataset of all drug–event pairs (EBGM ≥ 4 cutoff). However, the metric with the highest PPV and NPV is the Q-PRR query metric when using the reference PRR SDA cutoff for FAERS data.

3.3 Additional Modifications to Maximize Performance

Sensitivity testing with the reference PRR SDA was also performed to assess its effect on the tested query metrics ROC AUC. Because the reference PRR SDA had the lowest false-positive rate for the query metrics, the AUC values were evaluated by increasing the value of N (number of reports for a drug–event pair), basically varying the threshold for the reference. The AUC results for the query metrics against different N values of the reference PRR SDA are shown in Table 8. Increasing the size of N in the PRR gold standard variable increased the AUC value for some of the query metrics. In particular, the Q-PRR metric had the highest AUC value (0.614) when the PRR threshold required that N be ≥ 200 reports.

Table 8 Sensitivity analysis of increasing the sample size threshold for proportional reporting ratio reference metric

When all the metrics and measures were combined into a composite model of metrics (CMM) to measure performance, the AUC increased for most of the reference SDAs (Table 7). Again, the performance of the CMM had strongest agreement with the EBGM ≥ 4 SDA, with an AUC of 0.821 (Fig. 4). The performance of the CMM improved from 0.558 to 0.758 when changing the various reference threshold cutoffs for the PRR metric (Table 8).

Fig. 4
figure 4

Receiver operating characteristic curve for composite model of 11 metrics by  EBGM ≥ 4 signal-detection algorithm

4 Discussion

The best performance of the individual search query metrics was found with the Q-PRR and PQR metrics when testing against the EBGM ≥ 4 as the SDA for FAERS. The PQR test had the highest sensitivity and highest AUC value for the internet search query metrics when using the EBGM ≥ 4 as the reference. The Q-PRR had the highest specificity (lowest false-positive rate) as well as the highest AUC value for the search query metrics. However, these metrics were not very specific for any of the reference metrics, as they still had many false-positives.

When all the query metrics were put together into a composite model, the agreement with each of the reference SDAs improved. The composite model had the strongest agreement with the EBGM ≥ 4 SDA, with an AUC of 0.82. This demonstrates that the addition of each metric to a composite model improved the predictability with the reference over each metric individually. It also suggests that a surveillance system may require monitoring of multiple metrics for optimal detection.

The high false-positive rate may represent false signals (creating noise) or, alternatively, AEs that are under-reported (or unreported) in the FAERS database. Yom-Tov and Gabrilovich [7] suggested that under-reporting may be the more prominent of the two, as late-appearing less acute AEs can be detected using internet search query log data while being under-reported in FAERS. Therefore, additional investigation is needed to form more encompassing gold standards than FAERS.

Here, we focused on developing new measures to screen the web query data for potential signals using the FAERS database as the gold standard. While a spontaneous reporting system such as FAERS is not a true gold standard, it is a standard used by health authorities and has very practical applications in pharmacovigilance practices commonly used within industry.

The Q-PRR test had the lowest false-positive rate, at around 33%, across all the FAERS reference SDAs when tested among all the drug–event pairs. The lowest Q-PRR false-positive rate was obtained using the reference PRR SDA (32.2%); this may be because the Q-PRR has an algorithm design similar to that of the PRR measure used in this reference. The PQR test had a higher false-positive rate, at around 44%, suggesting that the best overall performing individual metric for any of the reference SDAs was the Q-PRR test. The CMM test exhibited the strongest agreement with the EBGM metrics, particularly when EBGM ≥ 4 was the reference.

These query log methods appear to capture signals similar to those captured by the reference PRR SDA and yet may also find additional statistical signals that may or may not be true safety signals. Limitations to these search query methods for signal detection include that a drug and event searched for by the same user does not necessarily equate to an AE occurring in an individual. Media attention, patients reading drug labels, and numerous other reasons may cause individuals to search a drug name and a symptom/AE at the same time. Despite these limitations, a good measure of signal detection with known reliability and validity is needed to find true safety signals from search engine data because a high volume of internet data can be mined for safety signals when these data are used in the appropriate context [16]. This analysis is one way to enhance signal detection of AEs in a manner faster than is currently available with the spontaneous AE reporting system, FAERS. That said, query log data contain high levels of noise, and the best methods to reduce the noise and enhance searches for true safety signals are still being explored [5, 8, 18]. Any signal identified by SDAs using DPA requires further medical analysis for signal validation to determine the true safety signals warranting a thorough evaluation to assess whether they represent a risk for patients receiving the drug (e.g., qualitative analysis of case narratives). Further work is also needed to validate the predictive ability of internet search query log-derived signals for safety events [5].

At least one study that compared the performance of the reference SDAs in other healthcare databases with that in spontaneous-reporting databases found a similarly poor performance of the SDAs, suggesting these disproportionality methods did not discriminate true positives from true negatives in the observational healthcare databases and simulated datasets as they reportedly do with spontaneous-reporting data [19]. A recent study by Odgers et al. [11] showed that web search logs from healthcare professionals using a medical database can be used as a predictive data resource to detect ADRs with well-established drug–event associations as well as a set of recently established ADRs. This work utilized two different reference standards to validate the methods but was also limited to data provided by healthcare professionals rather than searches conducted by the general public.

Other limitations known to exist with spontaneous event-reporting databases such as the FAERS database are that these databases need extensive pre-processing prior to analysis to remove duplicate reports, correct terminology errors, collate event terms and drug names into relevant categories, and remove or adjudicate cases with missing data [20]. The performance of the current standard SDAs for FAERS and other spontaneous-reporting databases has been shown to vary by prevalence of AE, medical terminology classifications, how long the drug has been marketed, and the type of AE database being examined [12, 14, 21].

It has been suggested that spontaneous-reporting databases under-represent the true number of AEs that occur in the population [2]. The web query log data may capture additional AEs that are not reported to federal regulatory bodies, and this data source could be considered an augmented approach to FAERS database for signal detection. Testing the performance of any new method when compared with spontaneous-reporting databases such as FAERS as the reference will remain a challenge. Alternate gold standards may need to be identified. While the generalizability of these query metrics is limited, we think this approach contributes significantly to the developing field of web-based signal detection where reliable methodologies have yet to be established.

5 Conclusions

Internet search query methods may provide an additional approach to examining a new and large data resource for signal detection of AEs. Of the three individual metrics analyzed, the PQR and Q-PRR metrics performed the best; however, they still generate a moderate level of false-positives compared with the reference SDAs. A composite model of the query metrics had the best performance overall, yet further testing needs to be considered prior to implementation in signal detection. These query-based metrics were developed primarily for hypothesis generation and provide a supplemental method to examine alternative data sources (e.g., web searches, social media) in pharmacovigilance. It must also be remembered that signals in social media and internet search data may not be true safety signals, being instead due to circumstantial events within the media (e.g., increased news media coverage of a health risk). Further enhancements will be needed to provide a validated and reliable tool for signal detection in the future.