FormalPara Key Points

Regulatory adverse event reports can be used by the US Food and Drug Administration to identify post-market drug safety trends; however, this requires intensive manual review.

Machine learning in combination with natural language processing techniques can be used to classify textual report data based on manually annotated training data sets.

This work successfully demonstrated a proof-of-concept machine learning approach to automatically detect adverse events in several textual regulatory data sets to support post-market regulatory activities.

1 Introduction

The US Food and Drug Administration (FDA) Adverse Event Reporting System (FAERS) is a database that contains adverse event (AE) reports associated with marketed drugs and supports the FDA’s post-market drug safety efforts (https://open.fda.gov/data/faers/). FAERS receives safety reports voluntarily from drug consumers and health care providers, while manufacturers are required by law to submit safety reports they receive from the public [1]. Despite rigorous clinical trial requirements, new and unknown safety issues may arise in post-market phases for drugs. This can be primarily attributed to the challenges associated with perfectly mimicking post-market conditions during clinical trials [2]. FAERS can help identify previously unknown safety issues and drug–AE associations. While data are plentiful with more than a million reports filed each year, identifying these associations is laborious because the reports require intense manual review [3].

In each report, a summary of the AE that occurred is contained in a patient narrative. The patient narrative contains a free-text description of the event and is coded with preferred terms (PTs) from the Medical Dictionary for Regulatory Activities (MedDRA). By utilizing standardized codes for labeling AE reports, the FDA can identify trends regarding potential safety threats and causal relationships between drugs and AEs [4]. This information could also aid in identifying at-risk patient subpopulations, tracking inappropriate prescription trends, and facilitating continued surveillance over time [5]. The coding is performed manually and is therefore highly labor intensive [6, 7]. Providing automated and standardized support for the labeling of patient narratives could improve reviewer efficiency and is of significant interest to the FDA [3, 8].

In addition to AE narratives, MedDRA has also been applied to structured drug product labels (SPLs). SPLs facilitate distribution of information regarding marketed medical products in a standard format for use in health information systems [9]. Currently, manufacturers are not required to describe AEs using MedDRA terminology in SPLs [3]. Therefore, AE terms must be manually extracted from product labels to provide useful data. The process of manually annotating product labels using preferred MedDRA terminology is laborious [8]. Methods for automating the extraction of AEs from drug labels and mapping these events to MedDRA terminology could be beneficial for regulatory bodies.

Automated detection of AEs from text is an active area of research in this field, with experts exploring many techniques. The simplest approaches have relied on lexicon matching [3, 10,11,12,13,14,15,16] and rule-based systems [3, 6, 9, 15, 17,18,19,20]. While generally reliable and simple to apply, these approaches may have difficulty managing informal language and deciphering complex linguistic relationships. Assorted predictive statistics models (e.g., regression, support vector machine [SVM], decision trees) have been used as well [6, 11, 16, 18, 21,22,23,24,25]. Text and data from many sources have been explored for detection of AEs, including drug labels [3, 9, 10], social media posts [11,12,13, 21], biomedical literature [14, 15, 22,23,24,25,26], web search logs [27], health records [16,17,18,19, 28, 29], and regulatory reports [6, 15, 20, 26]. For a more in-depth review of recent literature, see [30].

Deep learning models have been proposed as a means for automated extraction of AEs from text data. While less interpretable and more computationally expensive than other predictive models, neural networks can provide classifications based on complex contextual associations between features. Deep neural networks, typically in conjunction with word embeddings, such as GloVe [31], have proven useful for the classification of text associated with adverse drug events [6, 14, 22, 23, 25]. Recent advances have made more complex models available as well. Convolutional neural networks (CNNs) learn contextual information about features by including a convolutional and pooling layer in its architecture. While most commonly applied to machine vision tasks, CNNs have been used in research for text classification and AE detection [16]. Recurrent neural networks (RNNs), which are designed to make predictions based on arbitrary-length sequences of input, have been used as well. Because the information conveyed by text is highly dependent on word order, RNNs are highly suited for text classification. Recent implementations for drug AE detection include the use of bidirectional RNNs [12, 13, 19, 24], RNN with an attention mechanism [16, 29], and a RNN-Conditional Random Field ensemble [28].

For this work, the researchers propose an RNN for classification of patient narratives using MedDRA PTs. This proof-of-concept model will demonstrate the suitability of RNNs for the classification of MedDRA PTs given FAERS patient narratives. Further, the model was tested by the researchers on FDA SPLs to demonstrate its utility with text data that is more structured. The focus of this research is on the suitability of the data sets for predictive modeling, not to demonstrate the limits of state-of-the-art machine learning algorithms. Further, the objective was not to benchmark the results of the prediction task for each data set against one another. This proof-of-concept will serve as a building block for additional projects with the goal of aiding narrative coders in standardizing FDA AE data.

2 Methods

In this work, two types of data common in FDA regulatory activities were used to develop and validate a proof-of-concept application of an RNN model to extract PTs. The objective was to use the full documents contained in each data set to predict PT labels. First, the researchers performed training and cross-validation using patient narratives from the FAERS database. Second, a new RNN was fit and cross-validated on a data set of SPLs. Performance for these efforts is reported.

2.1 FAERS Data Set

The FAERS data set used for this project contained 325 event entries that included both patient narratives and PT labels. Each event can have many PT labels associated with it. For example, the following event narrative was contained in the data set:

“A one-year-old female (born in [RED]) experienced an encephalopathy (no etiology for the encephalopathy could be found) with severe hypotonia and epilepsy (exact start date not reported). The epilepsy is partly controlled with valproate sodium, clonazepam and vigabatrin treatment. Her father had been treated with omeprazole for several years (exact start date, dosage and indication not reported, his medical history included a nephrotic syndrome [sic] for which he had been treated for more than two years before the conception). Concomitant medication not précised [sic]. The reporting physician considered this to be a congenital anomaly.’

The PTs associated with this event narrative were ‘epilepsy,’ ‘encephalopathy,’ ‘hypotonia,’ and ‘congenital anomaly.’ These PTs are highlighted in the above excerpt. Note that PTs may not exist in a document in the exact preferred form. The model must therefore be able to detect synonyms of PTs, or collections of words that describe the PT. In all, there were 618 unique PTs in the data set. Many of these PTs occurred sparsely, with most of them occurring only once. In this work, several of the frequently occurring PTs are considered. Narratives can be labeled with more than one PT; however, in this research each PT is treated as an individual, binary classification problem.

Narratives are unstructured and highly variable. The length of narratives can vary considerably, with the shortest containing a single word and the longest containing more than 2000. Narratives are typically informal descriptions of the event and can contain many contextual clues about the nature of the event that do not necessarily use official or predictable vocabulary.

2.2 Structured Drug Product Label (SPL) Data Set

The SPL data set included 100 product labels manually annotated with MedDRA PTs by FDA personnel. The data set was originally provided as part of the FDA Adverse Drug Event Evaluation challenge conducted during 2018–2019, through which researchers worked to develop a tool for automated extraction of AEs from SPLs. Details on the data sets are available in [32].

Each SPL was provided in an XML format, from which relevant text was extracted. Each label is organized into several standardized sections. These sections include Boxed Warnings, Warnings, General Precautions, Warnings and Precautions, and Adverse Reactions [32]. This work focuses exclusively on the Adverse Reaction sections of the labels. This section is used as the primary basis for manual PT annotations. The Adverse Reactions section should contain all adverse reactions that have been attributed to a medication, while other sections of the label may provide additional information regarding the severity of certain reactions, as well as recommendations on how to monitor or treat patients who experience certain adverse reactions [32].

Table 1 contains a summary of the data sets used in this work.

Table 1 Summary of data sets

2.3 Model Choice

RNNs are an adaptation of standard neural networks designed to handle sequential data of arbitrary length [33]. This is particularly useful for making predictions based on unstandardized text which can vary in length by word. An RNN is composed of a series of ‘cells,’ where sequential data (e.g., text) is used as input. A series of algebraic operations are applied to the data inside the cell, and the resulting information is passed to the next cell. Within these operations, the RNN cell applies learned weight parameters to a concatenated vector containing the cell input and a ‘hidden state.’ In the case of text classification, a cell input corresponds to a word in a sequence, with the total number of cells equal to the length of the sequence. The hidden state is a vector of user-specified length that is passed from cell to cell and serves to provide ‘memory’ over the sequence to the model. Learned parameters are fit to the data set by comparing the output of the final cell (prediction) to the true value or classification associated with each datapoint. The function that computes the difference between model output and the ground truth is called the loss function. The loss function is typically computed and summed for all datapoints in the training data set, resulting in the total loss. To fit the model, the loss function is minimized, typically using a gradient-based optimization technique, where parameters within each cell are varied to achieve a minimum loss.

The issue with simple RNNs is that they are susceptible to difficulties when training model parameters. This is referred to as vanishing gradients, whereby gradients approach zero during optimization, preventing parameters from updating. This is particularly problematic for large networks [34]. Long Short-Term Memory (LSTM) networks were developed to address this limitation by including mechanisms to prevent vanishing gradients. The core concept behind LSTM is the inclusion of a memory cell that maintains information over long periods of time (elements in a sequence) and nonlinear gating units that regulate the flow of information in and out of the memory cell [35]. Gated Recurrent Unit (GRU) [36] (pre-print article) networks are a more recent development that take the memory cell and gating concept of LSTM, but reduce the number of required gating units, and therefore reduce the number of parameters. GRU has performed similarly to LSTM, but with reduced computational burden [37]. Because this work was to serve as a proof-of-concept, GRUs were used, providing benefits of the LSTM model while also reducing computational resources and the amount of time required.

2.4 Preprocessing Tasks

Figure 1 displays a high-level diagram of the preprocessing tasks detailed in Sect. 2.4 (left side), and the experimental validation procedures detailed in 2.5 (right side).

Fig. 1
figure 1

Diagrammatic representation of natural language pre-processing tasks applied to text data (Sect. 2.4) and the model cross-validation procedures (Sect. 2.5) used to estimate model performance. FAERS FDA Adverse Event Reporting System, SPL structured drug product label

It is typical to perform several standard natural language preprocessing (NLP) tasks to format text for use in machine learning models. First, text data was tokenized, a process in which bodies of text are split into individual words (tokens). Next, unwanted text is filtered. Punctuation, nonalphabetic words, and stop words were removed. Stop words are common words that provide little information in prediction tasks, such as ‘what,’ ‘where,’ ‘is,’ ‘are,’ ‘a,’ and ‘the.’ Tokens were then lemmatized, which transforms words into their base morphological form. For example, ‘mice’ would become ‘mouse’, and ‘eating’ would become ‘eat.’ All-natural language processing tasks were performed using the Natural Language Toolkit (NLTK) in the Python programming language (https://www.nltk.org/).

For use in an RNN, data must be translated into a sequence format. In sequence format, each token in each data point becomes a unique feature or column entry. Rows are padded with zeros to the length of the longest sequence. Keras, a Python library that provides the building blocks for developing deep learning models, provides many features and resources to facilitate model building [38]. Keras was used to format data into sequence format for RNN model input and is also used for model definition and training in the steps outlined below.

To train a statistical model, text must be translated into a numerical format. One of the simplest models for creating real vectors from text is bag-of-words. This model characterizes documents in a textual data set as vectors, where each entry of the vector typically corresponds to the frequency of unique words present across the entire data set. Words are also typically assigned a weight based on their frequency across documents [39]. The limitation of this and similar approaches is that information contained by word order is lost. Additionally, there is no way to compare lexically similar words. For example, comparing the words ‘cat’ and ‘lion’ will result in the same value as comparing ‘cat’ and ‘automobile’, even though the former are obviously more related. This also means the model will not be able to handle words not encountered in training data [40].

Word embeddings are vector representations of words generated with the goal of representing linguistic similarity mathematically, and are commonly used in text classifications [41]. Word embeddings allow algebraic operations to be performed on words such that linguistic meaning is preserved. Most word embeddings are derived from massive corpora using unsupervised or semi-supervised machine learning and dimensionality reduction techniques. Word2vec [42] (pre-print article) and GloVe [31] are popular algorithms that have seen significant use in recent years for NLP tasks.

In this work, GloVe pretrained word embeddings were utilized. Past researchers have used GloVe successfully for similar AE detection tasks [28]. GloVe uses a log-bilinear regression model to fit weight vectors to words based on the probability of word–word co-occurrence in a large text corpus. The resulting word vectors, or word embeddings, exhibit contextual information with relation to one another—in essence, quantifying the relatedness of words [31]. GloVe can be used to fit new custom word embeddings given a large corpus. GloVe also has pretrained embeddings that can be repurposed for new tasks. In this work, a GloVe word embedding trained on Common Crawl (https://commoncrawl.org/) data was used. The embedding was trained on 42 billion tokens, has a vocabulary of 1.9 million words, and contains 300-dimension word vectors. For the FAERS data set, the GloVe word embedding covered 92.9% of the contained words. For the SPL data sets combined, 82.7% was covered. For the annotated SPL data set, 93.8% was covered.

2.5 Model Experiments

The first devised test was to develop an RNN model to classify FAERS patient narrative entries by FDA PTs. A 5-layer deep GRU network was defined using Keras in Python 3.6. To evaluate model performance, several of the highest prevalence PTs were selected for validation. The PTs included, their frequency, and lengths of the processed text in the data set are shown in Table 2.

Table 2 Preferred terms included in the FAERS data set experiments

K-fold cross-validation is a technique to estimate the performance of a model on new data and is performed on labeled training data. This technique splits the data into k ‘folds,’ trains the model on k-1 folds, and tests the model on the remaining fold. This is repeated until each fold has been left out and tested on once. The results are then averaged. The folding procedure is often repeated several times to get unique combinations of datapoints, usually noted as r × k-fold cross-validation, where r is the number of repeats.

We performed 5 × 5-fold stratified cross-validation for ‘drug interaction’ and ‘drug ineffective;’ 4-fold stratified cross-validation was used for other PTs with lower frequencies to ensure several positive cases were present in the validation set. To manage the relative unbalanced nature of the data set, minority classes were randomly oversampled to the size of the majority class. Initial tests indicated overfitting may be an issue, so dropout was incorporated into the model [43]. Machine learning models are trained by minimizing some loss function comparing model-predicted labels with the true labels. Optimization of model parameters is dependent on the parameters of the optimization routine, referred to as hyperparameters [44]. Hyperparameter tuning was performed to improve the model training performance. Model parameters are shown in Table 3. Binary cross-entropy was used for the loss function.

Table 3 RNN model parameters

The focus of this work was to assess the given data sets as candidates for modeling with sequence-based architecture neural networks, not necessarily optimizing the model for the data. Therefore, the choice of five hidden layers was an arbitrary choice given the proof-of-concept nature of this work. Further, while a bidirectional architecture was considered for the model, a standard uni-directional RNN was opted for to demonstrate the simplest case benchmark for future improvement.

To demonstrate a predictive classifier, it is common to benchmark the results against other conventional classifiers [28, 29]. Two simpler classifiers were used for validation as well as for comparison. Logistic regression and support vector machine with a radial basis function as the kernel function were used. In both cases, the average word embedding for each narrative was used as input.

Metrics used to evaluate performance were recall, precision, and F1-score, of which the latter is typically the preferred metric in NLP for evaluating model performance [45]. Accuracy can provide misleading results when classes are imbalanced if detection of the minority class is valued more than the majority, such as the case with the data sets in this paper. For model validation, training continued for 20 epochs or until validation F1 did not improve for 10 epochs. Medians and interquartile ranges for results are reported. Multiple comparisons using Mood’s median test with Bonferroni correction were performed for each PT between each model for the F1-score.

In the second test, the goal was to evaluate how well the same type of model (RNN) would classify SPLs based on relevant PTs. The researchers performed validation using the 100 annotated product labels. We performed 5 × 5-fold stratified cross-validation for several high prevalence PTs, as well as for several low prevalence PTs to serve as a comparison to the narrative prediction performance. The included PTs, their frequency, and the length of the processed text in the data set are shown in Table 4. Model performance was benchmarked against logistic regression and SVM with a radial basis function. Model results were compared again using Mood’s median test for post-hoc comparisons.

Table 4 Preferred terms included in the SPL data set experiments

3 Results

Discussed here are the proof-of-concept results for the FAERS data set and SPL data set PT classification tasks.

3.1 FAERS Patient Narrative Classification

Reported in Table 5 are the median and interquartile range (IQR) for validation F1-score, recall, and precision for the FAERS data cross-validation. Superscripts are used to signify multiple comparisons results. The RNN only outperformed the other models for ‘drug interaction’ but not significantly so for either. The RNN performed significantly worse for ‘acute kidney injury’ and ‘seizure.’

Table 5 FAERS AE detection median (interquartile range) F1-score, recall, and precision resulting from cross-validation

Table 6 contains the results of the multiple comparisons analysis. The largest differences in performance were observed for ‘acute kidney injury,’ with the RNN performing significantly worse than the other models.

Table 6 Results of multiple comparisons analysis for FAERS narrative AE detection model F1 results using Mood’s median test with Bonferroni correction

3.2 SPL Classification

Reported in Table 7 are the median and IQR for validation F1-score, recall, and precision for the SPL training data cross-validation. The RNN performed significantly better than the other models for the low occurrence PTs and only performed significantly worse in the case of ‘diarrhea’ compared with logistic regression, but with a very small effect size.

Table 7 SPL AE detection median (interquartile range) F1-score, recall, and precision for cross-validation

Table 8 contains the results of the multiple comparisons analysis. Differences in performance were largest between the RNN and the other models for ‘arrythmia.’

Table 8 Results of multiple comparisons analysis for SPL AE detection model F1 results using Mood’s median test with Bonferroni correction

4 Discussion

The following is a discussion of the model results and their implications as a proof-of-concept.

4.1 FAERS Patient Narrative Classification Validation

These results provide evidence that, with some additional model optimization and more data, the RNN model could assist in automated extraction of AE data from FAERS patient narratives. Overall, however, the given patient narrative data set did not generalize well enough for the RNN model to reliably predict the selected FDA PTs. The generally poor results likely can be attributed to a combination of small sample size, the short length of some narratives, and the unstructured nature of the text. Oversampling of minority classes did aid in overcoming the imbalance issues; however, the poor F1 scores suggest that positive cases may be too few and text may be too variable to generalize effectively, and thus the model was prone to overfitting (particularly for the RNN). For drug ineffective, acute kidney injury, and seizure, the contents of narratives were extremely short. In one case, for seizure, the text was one word after processing. It is very unlikely that these extremely short narratives were correctly predicted in any of the cases.

For the patient narrative classification task, results were mixed between models. Results indicated that for ‘drug interaction’ the RNN had the best performance. For the other terms, the simpler models performed better. The difference in performance may be due to the nature of the terms themselves. ‘Acute kidney injury’ and ‘seizure’ point to specific ailments with distinct terminology that may have resulted in a relatively linear influence on the average word embedding. ‘Drug interaction’ and ‘drug ineffective’ describe more abstract concepts and may require a more complex model to recognize their linguistic signal. That said, the RNN did not perform significantly better than the other models for any term so it is difficult to infer the generalizability of this type of data without a larger sample.

4.2 Structured Product Label Classification Validation and Testing

Overall model performance was significantly better for SPL validation, likely due to the structured nature of the product label text and the more balanced class distribution. The RNN model performed at least as well as the simpler models for all terms except ‘diarrhea’ and performed significantly better than both models for ‘malaise’, ‘flushing,’ ‘atrial fibrillation,’ and ‘arrythmia.’ That the deep learning model performed similar to the simpler models for the high frequency terms and better for the low frequency terms suggests that the ability to detect complex linguistic relationships can help overcome class imbalance issues.

The discrepancy between the RNN’s ability to extract low frequency terms from the FAERS data versus the SPL data is not entirely understood. It may be that the structured and comprehensive nature of product labels contains information that the patient narratives do not. The RNN may be able to extract more predictive value from word order from the SPLs than the patient narratives due to the structured nature. Additionally, SPLs contain comprehensive information regarding possible drug adverse reactions. Patient narratives generally do not contain as many references to medical terminology that may help a model develop associations. RNNs are well suited to identify this type of complex co-occurring medical terminology associated with certain AEs.

4.3 Limitations and Implications

This work suggests that deep learning models, specifically RNNs, can be used to extract AEs as preferred terminology from SPLs at least as well as, and in some cases better than, other standard predictive models. This work was limited in that validation results only demonstrated model effectiveness for a small subset of PTs. Future work should focus on optimizing the current model more rigorously, as well as exploring additional model architectures. A model that produced consistent results across PTs would increase the usefulness of SPLs for discovering drug–AE associations.

Another limitation of this work occurred with respect to data preprocessing. Non-alphabetic words were removed from text prior to fitting models. In hindsight, this may have removed valuable information from the SPL data set, as some PT terminology relies on mixed alpha-numeric terminology (e.g., HLA-B*1502-positive). This was less of a concern for the FAERS data set, which typically contained less technical language. Future work should verify the influence of this terminology in adverse drug event text data for similar prediction tasks.

In this work, each PT prediction task was treated as an individual, binary classification problem, and a separate model was fit in each case. While we could have integrated the prediction tasks into a single architecture as multiple binary classification problems, oversampling the data for minority classes can be more complicated in this case. This ultimately shouldn't significantly influence the results, however if the goal was to classify a much larger sample of PTs, the used approach would be time-prohibitive.

While model performance was less ideal for the patient narrative data, it should not be invalidated as a candidate for prediction using this type of model. The maximum number of samples for a single class was only 14, so it is not overly surprising that it did not generalize well. Manual annotation of patient narratives is an arduous task that requires specialized knowledge of MedDRA. Future work should focus on streamlining acquisition of additional data and evaluating new word embedding techniques as the field advances. Bidirectional Encoder Representations from Transformers (BERT), for example, is a technique that uses contextual information bidirectionally in the construction of word embeddings [46] (pre-print article). This allows BERT to express multiple meanings for words that appear identical based on context. BioBERT, an embedding model trained on a large corpus of biomedical text, could be especially useful in the pharmacovigilance domain [47].

5 Conclusions

Automated extraction of AEs using standardized terminology could aid in streamlining regulatory processes and discovering new drug–AE associations. Extracting events in real time from post-market patient narratives would be especially useful for detecting new safety issues and protecting public health. While model performance was mixed, especially for underrepresented PTs in the FAERS data set, this work provides evidence that well-represented terms can reliably be determined using an RNN. The evaluations of unannotated SPL predictions provided further support for this finding. Machine learning has the potential to increase the efficiency of discovering safety issues associated with pre- and post-market drugs from textual data. A concerted effort should be made to increase the amount of available annotated data such that these models can continue to be developed and optimized.