Introduction

Adverse drug effects pose significant health and financial problem worldwide. The World Health Organization (WHO) defines “pharmacovigilance” as “the science and activities relating to detection, assessment, understanding, and prevention of adverse effects or any other drug related problems”. Some of the adverse effects are detected during clinical trials, but some are detected after the drugs come to market. Considerable research and effort in pharmacovigilance is dedicated to adverse drug effect signal detection. Most often spontaneous reporting systems (SRSs) such as FAERS are used for detecting signals with statistical and data mining algorithms [1]. Adverse drug effects can also be detected in bibliographic databases such as MEDLINE [2]. Electronic patient records are another resource for detection of adverse drug effects [3]. Sometimes, combined signals from various sources can be used for adverse drug effects detection [4]. Recently, social media, such as medical message boards [5] and Twitter [6] has been used for adverse drug effect detection.

In contrast to the majority of other pharmacovigilance methods, whose goal is to detect drug safety signals, our goal is to provide an explanation for known adverse drug effects. More specifically, our goal is to provide a pharmacological and/or pharmacogenomics explanation by finding genes or proteins that link the drug to the observed adverse effect. Our basic assumption is that the drugs have some effect on some genes or proteins and that these genes or proteins are associated with the observed adverse effects.

Methods

We use Literature-based Discovery (LBD) [7] to find explanations for (drug, adverse effect) pairs. The goal of LBD is to generate novel hypotheses by analyzing the literature and optionally other knowledge sources. LBD uses either of two basic approaches: open discovery and closed discovery; both are based on a paradigm of three related concepts: X, Y, and Z. In open discovery only the starting concept is known. For example, if we want to find a new treatment for a given disease (X), we first try to find (patho)physiological characteristics (Y) of the disease and then seek drugs (Z) that can deal with these characteristics. In closed discovery both the starting concept (X) and the end concept (Z) are known, and we want to find intermediate, linking concepts (Y) that may help explain the relationship between X and Z. In any case, LBD is meant as a discovery support paradigm. LBD generates hypotheses, but a knowledgeable human expert is needed for the interpretation of these hypotheses [8]. Our methodology is meant to assist an experienced pharmacovigilance expert.

For the current study closed discovery is better suited because we work with known adverse effects. In other words, the starting concept (Drug_X) is known as well as the end concept (Adverse_effect_Z), and we want to find Genes_or_proteins_Y that somehow link the drug with the adverse effects. By finding the linking genes or proteins, we provide an explanation for an association found statistically.

For this research we used the closed discovery component of a LBD tool called SemBT [9, 10] available at [11]. SemBT uses semantic relations extracted with the SemRep [12] natural language processing system from all of MEDLINE.

Results

The SemBT version used for this study is based on semantic relations extracted with SemRep from 44,250,865 sentences. These sentences come from 23,657,386 MEDLINE citations (the entire MEDLINE database up to the end of March 2014). 15,175,993 distinct semantic relations were extracted from a total of 69,331,058 semantic relation instances.

Statistical evaluation

To evaluate our methodology we selected 51 true positive and 29 true negative (drug, adverse effect) pairs that were curated by pharmacovigilance experts. All the 29 true negatives and 28 of the true positives came from the EU-ADR project [2] because it is a well-established benchmark used in several recent pharmacovigilance papers. The additional 23 true positive pairs were added by pharmacovigilance experts because they believed that these pairs likely had pharmacogenomic explanation. For each pair, we created a ranked list of linking Y genes or proteins using SemBT.

For the group of 51 true positive pairs, we found a total of 1523 linking Y genes or proteins, giving 29.86 Ys per true positive pair. For the group of 29 true negative pairs, we found a total of 392 linking Ys, giving 13.52 Ys per true negative pair. The Nonparametric Mann–Whitney test for comparison of two independent samples was used to compare the number of Ys found in the two groups. There was a significant difference between the groups (p = 0.00975), with Group 1 having significantly higher values than Group 2. Therefore, our method finds considerably more Ys per (drug, adverse effect) pair for the true positive pairs than for the true negative pairs. For us this is an indication that our basic idea is valid, i.e. explaining adverse drug effects through the genes and/or proteins that link the drug to the disease.

Potentially new adverse drug effect explanations

For each true positive (drug, adverse effect) pair, a ranked list of linking genes or proteins was produced and given to a pharmacologist for expert evaluation. The linking genes or proteins (Ys) were ranked by the sum of distinct relations between the drug (X) and the Ys plus the distinct relations between the Ys and the adverse effect (Z). The pharmacologist found out that in the majority of cases, the adverse effect was due to the drug’s primary pharmacological effect, i.e. drug’s major mechanism of action, as was expected. However, he found a considerable number of cases where the adverse effect was not caused by the major drug action and therefore represented potentially novel ways to explain the adverse drug effect. Some of these cases are shown in Table 1.

Table 1 Providing explanations for reported drug adverse effects through linking genes or proteins

The examples in the table are explained in more detail below.

Azathioprine

Azathioprine is an immunosuppressive drug that is metabolized to 6-mercaptopurine, a purine analogue that inhibits DNA synthesis by inhibiting the enzyme hypoxanthine-guanine phosphoribosyltransferase (HGPRT). This leads to a cytotoxic effect in dividing cells; therefore, some of the reported adverse side effects, such as leucopenia, cytopenia, myelosuppression, and anemia, can be explained by the main mechanism of action. However, here we provide novel LBD approach to identify the protein targets to explain other reported adverse side effects, in particular, acute pancreatitis and hepatotoxicity.

To provide a new hypothesis for the mechanism of azathioprine-induced acute pancreatitis, we identified pancreatic lipase and glutathione S-transferase as protein targets, as shown in Table 1. Application of azathioprine can lead to an asymptomatic increase in pancreatic enzymes, such as lipase and amylase [13]. Indeed, the onset of acute pancreatitis is positively correlated with the abnormally high pancreatic enzyme levels, e.g. pancreatic lipase and amylase [14]. Importantly, pancreatic lipase is the key enzyme in the development of acute pancreatitis by releasing membrane-toxic fatty acids [15]. Moreover, azathioprine is a competitive inhibitor of glutathione S-transferase [16], and can thus lead to glutathione (GSH) depletion. Since GSH is an important intracellular antioxidant this leads to increased cellular oxidative stress. Indeed, GSH depletion is correlated with the pancreatitis [17]. Furthermore, hepatotoxicity is also correlated with GSH depletion [18], which was also detected by our SemBT software.

Irinotecan

Irinotecan is bioactivated by carboxylesterases to SN-38, a molecule which is an inhibitor of topoisomerase I, and thus leads to the inhibition of both DNA replication and transcription in dividing cells. Thus, some of the reported adverse side effects, such as myelosuppression, neutropenia, and cytopenia, can be explained directly by the cytotoxic action (main mechanism of drug action) on dividing immune cells [19]. However, to explain other common adverse reactions, such as diarrhea, we applied literature-based discovery for identifying target proteins, as presented with semantic relations in Table 1. To explain diarrhea, we identified the uridine diphosphate glucoronosyltransferase 1A1 (UGT1A1) as a target protein. UGT1A1 is involved in the inactivation of the bioactive molecule SN-38 by glucuronidation [19]. Indeed, patients bearing certain specific gene polymorphisms of UGT1A1 have a higher risk of severe neutropenia and diarrhea [20].

Atorvastatin, simvastatin, pravastatin

Although statins are well tolerated in most patients, around 7–29 % of them have statin-associated muscle symptoms [21], which are now recognized as a clinically significant complication of statin therapy. There is a knowledge gap in understanding the mechanism of statin-induced rhabdomyolysis, and even more in their therapy. Thus, we tried to use the literature-based discovery approach to identify the target proteins, which might explain these statin-associated muscle side-effects. We used atorvastatin, simvastatin, and pravastatin as representative drugs of statins, and identified the SLCO1B1 gene encoding the OATP1B1 protein, Carnitine O-Palmitoyltransferase, and Cytochrome P450 3A4 as target proteins involved in statin-induced rhabdomyolysis. The semantic relations identified are presented in Table 1.

The first target was OATP1B1, which belongs to the family of a solute carrier organic anion transporters, and is an influx membrane transporter responsible for the uptake of statins into hepatocytes. Changes in its activity, either by drug-drug interactions or by SLCO1B1 gene polymorphism, can affect the pharmacokinetics of statins [22]. For example, the inhibition, or lower activity, can lead to increased bioavailability (higher plasma concentrations of statins), and thus to adverse reactions, such as rhabdomyolysis. The second target identified was Carnitine O-Palmitoyltransferase (CPT), which is a mitochondrial transferase enzyme involved in the metabolism of palmitoylcarnitine into palmitoyl-CoA. Abnormal regulation of CPT can cause rhabdomyolysis [23]. Importantly, statins can interfere with CPT activity, e.g. in one study atorvastatin increased the expression of CPT [24]. Moreover, CPT deficiency often also causes non-exercise-induced rhabdomyolysis [25]. The third target identified was Cytochrome P450 3A4 (CYP3A4), which is one of the most important enzymes involved in the drug metabolism. Importantly, statins are metabolized by CYP3A4, as they also inhibit its activity [26]. Therefore, concomitant administration of statin therapy and drugs that inhibit CYP3A4 increases the risk of rhabdomyolysis [27].

Semantic relation extraction evaluation

The quality of the explanations for the drug adverse effects provided in our approach largely depends on the quality of the semantic relation extraction process. Therefore, we conducted an evaluation to estimate the accuracy of the semantic processing. The evaluation was conducted at the semantic relation instance level. In other words, the goal was to determine whether a particular semantic relation was correctly extracted from a particular sentence. Eighty subjects, students in the final year of medical school (Faculty of Medicine, University of Maribor) received intensive training and detailed instructions on how to evaluate before conducting the evaluation. Subjects were organized in such a way that three of them independently evaluated the same semantic relation instance. However, subjects could decide whether to skip a relation to be evaluated and which ones to evaluate from the set of assigned relations. Therefore, it turned out that although most of the instances were evaluated by three subjects, not all were.

The semantic relation instances evaluated were a subset of those relevant to the true positive and true negative adverse drug effects mentioned before. In total 4069 semantic relation instances were evaluated 10,279 times. The instances were evaluated as correct 8646 times (84 %) and as incorrect 1633 times (16 %). 3795 distinct instances were evaluated as correct (93 %) at least once and 1068 distinct instances were evaluated as incorrect (26 %) at least once. If we did not take into account the number of persons who evaluated a particular relation instance, we found that 3369 (82 %) distinct instances were evaluated more frequently as correct than as incorrect: 442 (11 %) instances were evaluated more often as incorrect than as correct, and 258 (7 %) relation instances were evaluated as correct exactly as many times as they were evaluated as incorrect. However, if we consider only the relation instances being evaluated by exactly three evaluators (N = 1500), then 1321 (88 %) relation instances were evaluated more times as correct than as incorrect, and 179 (12 %) instances were evaluated more times as incorrect than as correct, 1062 instances were always evaluated as correct (71 %) and 45 distinct instances were always evaluated as incorrect (3 %).

Conclusions

We presented a tool and a methodology for finding pharmacological and/or pharmacogenomics explanations for known adverse drug effects through genes or proteins that link the drugs to the adverse effects. We found several potentially novel explanations, which cannot be explained by the drug’s major mechanism of action.