FormalPara Key Points

The MADE (Medication and Adverse Drug Events from Electronic Health Records) 1.0 corpus comprises 1089 electronic health records with detailed named entity and relation annotations.

We provide benchmark results from and analysis of the MADE 1.0 corpus using system submissions in the MADE 1.0 challenge.

MADE 1.0 results suggest that machine learning systems can be useful for automated extraction of adverse drug events and related entities from unstructured texts but that room for improvement remains.

1 Introduction

An adverse drug event (ADE) is “an injury resulting from a medical intervention related to a drug” [1]. ADEs are the single largest contributor to hospital-related complications in inpatient settings [2] and account for approximately one-third of all hospital adverse effects (AE). They affect more than 2 million hospital stays annually [3] and prolong hospital length of stay by 1.7–4.6 days [4, 5]. These events also account for approximately two-thirds of all post-discharge complications, more than one-quarter of which are estimated to be preventable [6]. National estimates suggest that ADEs contribute at least an additional $US30 billion to US healthcare costs [7].

Likely ADEs should ideally be detected in randomized controlled trials (RCTs) before the relevant drug ever enters the market. However, the limited number of participants and inclusion/exclusion criteria reflecting specific subject characteristics (demographic, medical condition and diagnosis, age) [8] means that pre-marketing RCTs frequently miss ADEs. This assertion is supported by the fact that the rate at which the US FDA withdraws previously approved drugs in the first 16 years ranges from 21 to 27% [9]. Drug safety surveillance and post-marketing pharmacovigilance, “the science and activities relating to the detection, assessment, understanding, and prevention of adverse effects or any other drug-related problem” [10], are therefore vitally important tools for monitoring FDA-approved drug safety.

One of the earliest systems for post-marketing pharmacovigilance aimed at improving drug safety is spontaneous reporting systems (SRSs) such as the FDA Adverse Event Reporting System (FAERS), a voluntary SRS. Although SRSs have been highly successful for pharmacovigilance, they have limitations such as under-reporting [11, 12] and missing important patterns of drug exposure [13]. To counter these shortcomings, other resources have been proposed for pharmacovigilance, including biomedical literature [14] and social media [15]. However, biomedical literature has been shown to identify only a limited set of ADEs, mainly rare ADEs [16]. Social media also has challenges, such as incomplete and erroneous drug exposure patterns and duplication [17].

It is well-known that electronic health records (EHRs) contain rich ADE information and have been widely used for drug safety surveillance and pharmacovigilance [2, 6]. Unlike other resources, which are passive in nature, EHRs can be a rich resource for real-time or active pharmacovigilance and patient–drug surveillance. In addition, they can lead to better and more cost-effective patient management [18]. In 2009, the FDA initiated the mini-sentinel program to facilitate the use of routinely collected EHR data for active surveillance of marketed medical product safety [19]. For example, Yih et al. [20] showed an increased risk of intussusception after rotavirus vaccination in US infants.

However, most of the EHR-based pharmacovigilance and patient safety surveillance systems are based on the analyses of the structured data such as International Statistical Classification of Diseases and Related Health Problems (ICD) codes [21, 22]. It is well-known that ADEs are often buried in the clinical narrative [23,24,25] and are not separately recorded in diagnosis codes or other structured data fields. Even information that is expected to be reported in the structured fields, such as bodyweight, frequently appears only in EHR text [26]. In addition, necessary information that can be used to assess the causality of a medication and ADE, including temporal and causal relations, only exists in the narrative. However, extraction of ADE information from EHR narratives remains a challenge because manual data extraction is very costly. It is, therefore, a significant impediment to large-scale pharmacovigilance studies.

Natural language processing (NLP) may be a solution to provide fast, accurate, and automated ADE detection that can yield significant cost and logistical advantages over the aforementioned practices of manual chart review or voluntary reporting [27]. However, despite the advances in NLP, few methods are specifically developed for detecting ADE information from EHR notes. Several NLP systems use resources such as the Unified Medical Language System (UMLS) [28] to extract disease and drug mentions to generate ADE predictions using co-occurrence metrics. Such approaches may miss other important drug information (e.g., indication and dose) and would fail to capture temporal and/or causal associations between a drug and an ADE that may be explicitly expressed in EHR narratives. Moreover, different NLP systems have been evaluated on different gold standards, making it challenging to identify state-of-the-art NLP technologies.

Therefore, we created the MADE (Medication and Adverse Drug Events from Electronic Health Records) 1.0 corpus, a publicly available, expert-curated benchmark of EHR notes that have been annotated with clinical named entities (i.e., drug name, dosage, route, duration, frequency, indication, ADE, and other signs and symptoms) and relations (ADE–drugname, indication–drugname, drugname–attributes, etc.). The MADE corpus is the first dataset that provides detailed annotations for medication, indication, ADEs, their attributes and relations relevant to drug safety surveillance and pharmacovigilance studies.

Using this high-quality corpus as a benchmark, we designed three shared tasks (named entity recognition (NER), relation identification (RI), and NER-RI) to assess the state-of-the-art NLP technologies that have the potential to improve downstream pharmacovigilance-related tasks. These shared tasks were organized in the First Natural Language Processing Challenge for Detecting MADE hosted by the University of Massachusetts (Amherst, Lowell, and Worcester, USA) from August 2017 to March 2018. In this paper, we first describe the MADE corpus, then document the shared tasks and provide a comprehensive report of system submissions in the MADE challenge. The main contributions of this paper are as follows.

  • Present the first richly annotated and publicly available EHR data for ADE detection and drug surveillance research.

  • Describe the carefully designed schema and release the detailed annotation guidelines, which would be a valuable resource to not only drug safety but also any other data-driven clinical informatics research.

  • Introduce three shared tasks in the MADE challenge and report the system submissions and results.

  • Perform an ensemble-based system aggregation that shows that the top systems are complementary and can be further integrated to push the boundaries of extracting medication, indication, and ADEs from EHRs.

2 Related Work

Natural language processing techniques have been widely applied to biomedicine [28,29,30,31,32,33]. Much of NLP research in the biomedical domain has centered on NER and normalization tasks. Examples of shared tasks in this domain include BioNLP [34], BioCreAtivE [35], i2b2 shared NLP tasks [36], and ShARe/CLEF evaluation tasks [37].

Existing NLP approaches for EHR ADE detection can be grouped into rule-based, lexicon-based, supervised machine learning, and hybrid approaches. For example, Li et al. [38] built an NLP system with the knowledge of a domain expert. Melton and Hripcsak [39] applied the NLP system MedLEE, a rule-based semantic parser, to detect concepts. Similarly, Humphreys et al. [40] applied MedLEE to map free text to the UMLS concepts and semantic types. UMLS [41] is a resource that combines multiple biomedical and clinical resources into a unified ontology. Rochefort et al. [42] developed supervised machine learning classifiers to classify whether a clinical note contains deep venous thromboembolisms (DVT) and pulmonary embolism (PE) using bag-of-words from EHR narratives. Haerian et al. [43] applied distance supervision to identify terms (e.g., including suicidal, self-harm, and diphenhydramine overdose) associated with an assigned suicide ICD-9 code and then used those terms to recover suicide events. Wang et al. [44] used MetaMap to identify drugs mentioned in the text threads of online health forums. Nikfarjam et al. [45] annotated ADE information on user posts from Daily-Strength and Twitter. They then used word-embedding models and conditional random fields (CRF) for prediction. Li et al. [46] developed NLP methods to extract medication information (e.g., drug name, indication, contraindication) and adverse events from FDA drug labels. Duke and Friedlin [47] applied MetaMap to identify ADEs from structured product labels.

Related works on corpora for clinical NLP research include the GENIA corpus [48] and the TREC Genomics [49]. Shared tasks such as BioNLP [34] and BioCreAtivE [35] have been widely used to train NLP applications. Other annotated corpora include the disease corpus [50], the BioScope corpus [51], and the MEDLINE abstract corpus from Gurulingappa et al. [52]. The corpus closest to ours is the i2b2 2009 corpus by Uzuner et al. [53], which provides annotations for medication and related named entities. However, our work extends the annotation schema used in the i2b2 corpus and provides a common dataset for medication and ADEs. Another similar work is that by Henriksson et al. [54], in which they annotated a dataset focused towards ADE extraction. In contrast to their work, the MADE corpus also provides annotation for medication details such as dosage frequency, etc., which are extremely relevant for pharmacovigilance studies.

3 The MADE Corpus

The MADE corpus comprises 1089 fully de-identified longitudinal EHR notes from 21 randomly selected patients with cancer at the University of Massachusetts Memorial Medical Center. Therefore, the notes include diverse note types such as discharge summaries, consultation reports, and other clinic notes (Table 1).

Table 1 The overall statistics for the MADE corpus

We used an iterative process throughout the annotation, going back and forth between document annotations and establishing annotation guidelines. In this process, we created a comprehensive annotation guidelineFootnote 1, which addresses various aspects on how to handle language variations and ambiguities in clinical narratives related to this annotation task. The guideline adapted and substantially extended the 2009 i2b2 shared task of the Medication Challenge annotation guideline [53]. The MADE annotation guideline is designed with a focus on extracting ADEs and other relevant clinical information. It defines nine named entity types and seven relation types. The relation types define relationships between pairs of annotated named entities. A succinct overview of the annotation categories is provided in the following subsections. The entities and relation types are described in detail in the text of the annotation guideline.

3.1 Named Entity Types

The named entity types can be broadly defined as either events or attributes. Events are annotations that denote a change in a patient’s medical status. This includes the prescription of a medication and identification of a symptom or diagnosis. Events have attributes, including severity and information related to medications (e.g., dosage). The occurrence of each named entity type is provided in Table 2. As evident from the table, there is a large label imbalance in the data, which means developing an NLP system to detect those entity types is challenging.

Table 2 Annotation counts, and word counts for each named entity type

Based on their context, the named entity annotations can be clustered into those related to sign, symptom, or disease (SSD) mentions and those related to medication (drugname) mentions. The two categories are described in the following subsections.

3.1.1 Sign, Symptom, or Disease (SSD)

Annotations in the SSD group define events and properties relevant to SSD mentions. The relevant named entity types are ADEs, indication, other SSD, and severity. ADEs, indication, and other SSDs are event annotations.

ADE ADEs are a type of SSD. They are adverse events caused by a drugname. An ADE annotation requires a direct linguistic cue that links the adverse effect to a drugname, e.g., “Patient had anaphylaxis after getting penicillin.”

Indication An indication is annotated if it is explicitly linked to a medication, e.g., “The patient was troubled with mouth sores and is being treated with Actiq.”

Other SSD Any SSD event that is not annotated as an indication or ADE is categorized as “other SSD”. In our EHRs, other SSDs frequently occur in the history section of notes, e.g., “headache in the back of the head.”

Severity These annotations are attributes of SSDs that indicate the severity, e.g., acute, mild, and severe, of a particular SSD.

3.1.2 Medication (Drugname)

Medication includes drugname and its attributes.

Drugname The drugname annotation includes descriptions that denote any medication, procedure, or therapy prescription, e.g., warfarin, propofol, chemotherapy, etc.

Duration Duration is the time range for the administration of the drugname, as explicitly described in the notes, e.g., 2 weeks, 15 h.

Dosage Dosage is the amount of drug in a unit dose. It is a numerical value and is an attribute of drugname entity, e.g., two tablets, 4 ml/h.

Frequency Frequency is the rate of administration of the drug and is an attribute of drugname, e.g., every hour, three times daily (t.i.d.), four times daily (q.i.d).

Route Route is the path through which a drug is taken into the body. It is an attribute of drugname entity, e.g., orally, central line.

3.2 Relation Types

Table 3 shows the seven relation types and their frequencies in the MADE corpus. A relation type is defined as a relation between two different named entity types. A brief description of each relation, along with the relevant named entities, is provided in the following.

Table 3 Annotation counts and relation length for each relation category

Drugname Attribute Relations

The MADE corpus contains four different relation types that describe a relation between the drugname entity and its various attributes:

  • Drugname–dosage

  • Drugname–route

  • Drugname–frequency

  • Drugname–duration.

The attributes (dosage, route, frequency, duration) are properties of the drugname entity.

SSD–severity Severity is an attribute to an SSD (ADE, indication, other SSD). It is typically a modifier (e.g., mild) for an annotated entity (e.g., fever).

ADE–drugname ADE is an adverse effect of the prescription of the drugname entity.

Indication–drugname The drugname entity has been prescribed as a direct treatment for the indication entity.

In the MADE corpus, the relations between the named entities can occur within a sentence or across multiple sentences in a note. Table 3 provides the relation length in characters. The ADE–drugname and indication–drugname relations have heavy long tails, indicating that, in several instances, they connect named entities that are several sentences apart. We discuss the implications of this trend in Sect. 4.3.

3.3 Annotators

The annotation process involved multiple annotators, including physicians, biologists, linguists, and biomedical database curators. Annotators were used both in document annotation and in the development of annotation guidelines. The following process was used to annotate each file. The first annotator individually labeled the span and type of named entities and relations. A second annotator then reviewed the annotations and modified them to produce the final version. This annotation process was used to reduce the annotation cost of each document while ensuring high annotation quality.

Since the annotations provided by the two annotators in this process were not independent, they could not be used to obtain estimates of inter-annotator agreements (IAAs). To get a fair IAA estimate, we performed a smaller study wherein five annotators independently annotated three documents from our corpus. We used the Fleiss’ kappa (κ) [55] measure of IAA. The κ for the named entity annotation and relation annotation agreement were 0.628 and 0.424, respectively. The relation κ measures the agreement in both named entity and relation prediction. The added complexity of combined named entity and relation annotation may explain its comparatively lower annotation agreement value. However, both values fall in the fair-to-significant agreement range [56], suggesting that our annotations are reliable for evaluating information extraction systems.

3.4 De-identification

The EHR data were de-identified using the Safe Harbor methods defined in 45 CFR 164.514b(2) by the US Department of Health and Human Services. First, the EHR data were processed by a publicly available de-identifier [57] so that the 18 types of Safe Harbor identifiers would be automatically annotated. Second, each clinical note was manually reviewed to ensure all identifiers were marked fully and correctly during the annotation process. All the marked identifiers were finally removed before releasing the data to the teams participating in the MADE challenge.

3.5 Evaluation Script

To standardize the evaluation of NER and RI, we developed an evaluation scriptFootnote 2. Our evaluation script uses biocFootnote 3 format, a simple format developed by the research community to share text data and annotations.

We used exact phrase-based evaluation, i.e., a named entity is correct only when the predicted span and entity type exactly matches the reference annotation. This is important as partial matching (e.g., infarction) may be semantically different from the exact matching (e.g., myocardial infarction). For relations, a predicted relation between two entity types is regarded as correct only if both the relation type is correct and the prediction of all relevant named entities is correct. We used F1 score for our system evaluation because it combines precision and recall, both of which are important metrics in the evaluation of information extraction systems. A micro-averaged F1 score was used to get an aggregate F1 score over all classes. We strictly followed the micro-average implementation used by scikit-learnFootnote 4. We also report both micro-averaged precision and recall scores for the systems in the interests of interpretability.

Our evaluation script also provides an approximate metric that uses word-based evaluation, i.e., a named entity is correct if one or more words match. However, approximate match was not used for evaluation in the MADE challenge. During the course of the challenge, we found instances of inconsistent inclusion or exclusion of the period for a named entity. Therefore, our evaluation script ignores span errors of one trailing character length to account for such inconsistencies. Details regarding the annotation inconsistencies can be found in Sect. 5.

3.6 Test and Train Data

In total, 213 notes from our MADE corpus were selected for the testing split, and the remaining 876 notes formed the training split of the MADE challenge. To minimize the potential of over-fitting and maximize the evaluation quality, we used two approaches to select the test set. We first selected three patients (of 21) from the MADE cohort and included all their EHR notes (a total of 153) in the test set. We then selected 0–4 notes from the remaining 18 patients to add an additional 60 notes for the test set.

4 The MADE Challenge

The MADE challenge invited participants to submit systems for three shared tasks. The MADE corpus training data were released in November 2017, 4 months before the final test run. Submissions were evaluated using the criterion described in Sect. 3.5. We designed two different runs: standard and extended. Submissions in the standard runs were limited in the type of external tools they could use; as such, it provided a fair evaluation of the NLP models. For this run, teams had no restriction in using open-NLP systems. They could use outputs of open NLP tools such as Stanford NLP, Natural Language Toolkit (NLTK), and the UMLS tools for feature engineering. However, they were not allowed to use custom clinical NLP software, other EHR datasets, or NLP tools trained on other EHR datasets. They were also not allowed to use proprietary or in-house NLP software. Systems not adhering to these constraints were categorized as extended runs. We allowed two submissions for each run from each participating team. However, we only considered standard runs in the final evaluation. Please refer to the relevant articles for an evaluation and analysis of their extended runs.

The three shared tasks were designed to evaluate submissions with an overall goal of identifying ADEs and other relevant entities and relations from EHR notes.

Task 1: Named Entity Recognition (NER)

This task required extraction of both EHR named entities spans and their types from EHR notes. The named entity types are described in Sect. 3.1. Input was an unlabeled raw EHR note. Output was a bioC file containing the entity span and type. The evaluation for this task used F1 score evaluation metrics based on exact phrase matches, described in Sect. 3.5.

Task 2: Relation Identification (RI)

This task required classification of relation and its type between two provided named entities. Since the named entities were provided as input, this task did not require detection of named entity spans or types. The relation types are described in Sect. 3.2. The input was the unlabeled EHR notes and a bioC file containing the list of present named entities. The output was a bioC file containing the relationships between the provided list of named entities, if any. The evaluation for this task used F1 score evaluation metrics, described in Sect. 3.5.

Task 3: Joint Relation Extraction (NER-RI)

This task required prediction of both the named entities and their relations. Therefore, the systems submitted to this task had to jointly conduct both NER and RI. Input was an unlabeled EHR document. Submissions were expected to correctly extract the named entities, predict the entity type, and predict their relations. Output was a bioC file containing the named entity and relations. The evaluation for this task used F1 score evaluation metrics based on exact phrase matches, as described in Sect. 3.5.

4.1 Submissions

The workshop submissions included 41 runs from 11 teams. The largest participation was in the first (NER) task and the smallest in the third (NER-RI). We show the exact F1 score, precision and recall for all teams in Table 4 (NER task), Table 5 (RI task), and Table 6 (NRE-RI task). Table 7 provides a tabular view of the features and methods used. The NER task (task 1) had a best F1 score of 0.829. RI and NER-RI tasks had best F1 scores of 0.8684 and 0.6170, respectively. We also evaluated the extended runs, although they were not used for MADE task rankings. No extended run performed better than the top system in any of the three tasks. The label-wise recall, precision, and F1 scores of the best submission system in all three tasks are provided in Tables 8, 9, and 10. These tables also provide the score for an ensemble prediction system composed of the top three runs in each task. More details about the ensemble system are provided in Sect. 4.3.

Table 4 Performance metrics for the best runs by teams for the named entity recognition task (shared task 1)
Table 5 Performance metrics for the best runs by teams for the relation identification task (shared task 2)
Table 6 Performance metrics for the best runs by teams for the joint relation identification task (shared task 3)
Table 7 Architecture details shared by teams within the workshop proceedings
Table 8 Label-wise recall, precision, and F1 score values for the top submission and ensemble in task 1
Table 9 Label-wise recall, precision, and F1 score values for the top submission and ensemble in task 2
Table 10 Label-wise recall, precision, and F1 score values for the top submission and ensemble in task 3

A brief overview of the methods used is provided in the following section. Please refer to the relevant team papers for further details about their methodologies.

4.2 Methods

As shown in Table 7, although the teams developed a variety of different sequence labeling and machine learning models, long short-term memory (LSTM) and CRF were the most widely used models for NER. The RI task showed a variety of models, such as support vector machines (SVMs), random forest, etc. A brief overview of the methods used in NER and relation classification are provided in the following.

NER

The task of NER can be posed as a sequence labeling problem, where a sentence can be treated as a sequence of tokens. The task is then reduced to the problem of labeling each token with the named entity tag or an “outside” (no named entity) tag. Commonly used algorithms for sequence labeling are Markov models (hidden Markov models, CRF), neural network models [convolutional neural network, recurrent neural network (RNN)], or a combination of both.

Linear-chain CRF [66] and related models (maximum entropy Markov model [MEMM] [67], hidden Markov model [HMM] [68]) belong to a class of methods in machine learning based on Markov models. CRF, which is a widely used method in sequence labeling, maximizes the joint probability of the label sequence conditioned on the input sentence.

RNN such as LSTM [69] or gated recurrent units (GRU) [70] are neural networks with recurrent connections that are designed to process sequential data. They have been shown to be useful in several NLP tasks such as NER [71], language modeling [72], and part of speech [73]. CRFs and RNNs were the two main methods used in the MADE challenge for the NER task. These methods take bag-of-word and other relevant features as inputs and produce the labels as outputs. Several teams experimented with character embeddings and other sub-word representations such as suffix and prefix embeddings. Features such as part of speech and surface features (case-based features) were also used with both RNN and CRF-based models. All teams use pre-trained word embedding to either pre-initialize their neural models or as features for CRF training.

Relation Classification

RI and NER-RI tasks require classification of the named entity pairs into several relation classes. The absence of a relation between the named entity pairs can be treated as another class in a multi-class classification scheme. Some submissions divide the classification into two separate sequential classification steps. The first classification task is to predict whether a relationship exists between the two named entity pairs. The second classification task predicts the type of that relation.

The classification methods range from neural network-based methods such as a bidirectional LSTM with attention layer [59] to random forests [64, 65] and SVMs [61]. Neural network-based methods use a final soft-max layer and cross entropy loss for training the relation classifier. SVM [74] is a statistical machine learning technique that uses maximum margin loss to train the classifier. Random forests [75] are a class of ensemble methods. They use the combined score from a collection of decision trees to produce the class prediction.

4.3 Analysis

The micro-average F1 score for task 3 was significantly lower than that for tasks 1 and 2. This was expected, since prediction in task 3 compounded the errors in both the NER and the RI steps. For the real-world application of extracting drugname and related ADEs, the F1 score needs to be further improved from its current best of 0.4272 in the NER-RI task (Table 10). A major factor behind the low score of the ADE-drugname relation type is the low NER F1 score of ADE. However, the prediction of this relation itself is also a challenging prediction problem as evidenced by the F1 score of 0.72 in task 2. This may be because the text span between two entities in this relation could be large (Table 3). Similar arguments can also be made about the relation indication-drugname, which is another important relation type for downstream applications such as drug-efficacy studies.

We ran paired sample t tests to evaluate the statistical significance of the differences between the top three models in each task. The paired t test evaluates the difference between two related variables. The differences were considered significant if the p value was < 0.05. The samples used in our test were obtained by using file-level micro-average F1 scores. For the first task, we found no statistically significant difference between the first [58] and second [59] systems. However, the third system [60] was statistically significantly different from that in both Wunnava et al. [58] and Dandala et al. [59]. For task 2, all differences between the top three teams were statistically significant. For task 3, the third system [62] was statistically significantly different from the first [59]. All other differences among the top three teams in this task were statistically insignificant.

We built an ensemble system using the submitted runs. The ensemble output for each task is generated using a simple majority vote scheme. A prediction (named entity or a relation instance) is used by the ensemble if a majority of submissions agree on it. For shared task 1, the entire named entity phrase along with its type is taken as one prediction instance. For tasks 2 and 3, the complete relation prediction (relation type along with its constituent named entity predictions) constitutes one instance. The ensemble F1 score is shown in Table 11. Each ensemble was created by choosing one best standard run from each team. We also used an ensemble composed of one standard run each from the top three teams for each task. The ensembles show significant performance gains in task 2 and 3 when compared with the best individual system in each shared task category. Even in task 1, the performance increase in F1 score is around 0.02. This indicates that the top systems in the MADE challenge do not all learn the same pattern from the dataset. Instead, there is variability in the information they learn.

Table 11 Performance metrics calculated for ensemble of submissions as described in Sect. 4.3

The NER and NER-RI tasks are interesting, not only from a research perspective but also because they have applications as steps in practical information extraction pipelines. It is non-trivial to accurately estimate an F1 score threshold for a good real-world performance. Therefore, we cannot calibrate an F1 score of 0.8 in the context of its real-world performance. However, in our experience, a precision score of 0.83 suggests that the system can extract reasonably accurate and useful data from unstructured text. Therefore, we believe that these models should be good enough for large-scale statistical studies where count-based thresholds can be used to reduce the noise in the extracted data. However, applications that require patient-specific information may need NER systems with higher recall and precision. For instance, systems that use statistical methods to predict the outcome on a patient level may be very sensitive to the noise introduced by MADE NER systems.

As mentioned previously, the NER-RI performance was markedly lower than the NER performance of the submitted systems. The precision of NER-RI systems was significantly improved by building an ensemble, as shown in Table 11. However, a relatively low F1 score of around 0.6 suggests that the current NER-RI systems need to be further improved to be useful in real-world applications. Future steps in improving these models can focus on (1) improving the machine learning models, (2) annotation efforts to build larger labeled corpora, or (3) designing machine learning techniques that use external knowledge and unlabeled text.

5 Corpus Errors

The annotations in the corpus contain a few inconsistencies and errors. Some of these errors were observed while testing the evaluation script on the data, and several were reported by the participating teams. The errors fall into two categories: inconsistency in annotations and overlapping annotations.

The inconsistency in annotations is due to inconsistent annotations of the period character in named entity annotations. As an example, for the phrase “q.i.d”, annotations may sometimes miss the last period character, “q.i.d”. This inconsistency is only exhibited with the period character and only when it is a trailing period. To account for this inconsistency, the evaluation script ignores trailing mistakes of one-character length for all tasks.

Errors due to overlapping annotations occur when spans of two named entities overlap. The two overlapping entities can be of the same or different types. A common error in this category is overlapping annotations in the same type. For example, both the phrase “vitamin D” and the overlapping sub-phrase “vitamin” are annotated as separate drugname annotations. Another common error in this category is double annotation on the same named entity. For example, the same phrase span “nausea” is annotated twice, as an ADE and as other SSD. Since it is not trivial to disambiguate the correct annotation for these errors, the evaluation script addresses this issue by treating all reference annotations as correct. This essentially means that the evaluation script slightly underestimates the true scores. However, since these errors are exhibited in only around 130 named entity annotations (from over 70,000 NE annotations in MADE), the evaluation script score still accurately accesses the performance of submitted systems.

6 Conclusion

We created an expert-curated corpus comprising longitudinal EHR notes from patients with cancer. The MADE cohort was annotated with medication- and ADE-related information. We released this cohort to the research community and used it as the benchmark to evaluate state-of-the-art NLP models. MADE results show that recent progress in NLP has led to remarkable improvements in NER and RI tasks for the clinical domain. However, as demonstrated by the joint NER-RI task, room for improvement remains. We invite future research efforts to improve the state of the art on these benchmarks.