
FormalPara Learning Objectives
  1. 1.

    Define the term natural language processing (NLP), and describe its relevance to clinical research.

  2. 2.

    List and describe four different approaches to developing NLP.

  3. 3.

    Describe the importance of a gold standard clinical corpus, and describe the five steps for developing a gold standard clinical corpus.

  4. 4.

    Discuss the benefits of using NLP to facilitate multisite clinical research and national research registries and describe challenges and strategies for deploying existing NLP solutions into different EHR environments.

The Role of Clinical Natural Language Processing in the Secondary Use of EHR

The Health Information Technology for Economic and Clinical Health (HITECH) Act of 2009 provides incentives for the rapid adoption and implementation of electronic health record (EHR) systems across the nation [1]. As a result, the availability of longitudinal and dense EHR data offers an unprecedented opportunity to conduct cost-effective clinical research (patient-oriented research, epidemiological and behavioral studies, or outcomes and health services research) [2]. Since then, there has been a rapid increase of studies reported using EHR data with applications including investigation of patient outcomes [3], disease comorbidities [4], risk stratifications [2], and drug interactions [5].

An EHR is a computerized health record for documenting patient information at care encounters [6]. EHRs can be represented through a variety of different formats such as (1) structured (e.g., demographic information, procedures), (2) semi-structured (e.g., patient provided information), (3) unstructured (e.g., clinical notes, radiology reports, pathology reports, operative reports), and (4) binary files (e.g., medical imaging files). A well-known challenge in EHR-based clinical research is that much of the detailed patient information is embedded within clinical narratives and represented in semi-structured or unstructured formats. A traditional method of screening or extracting information from EHRs for clinical research is manual chart review, a process of reviewing or abstracting information and assembling patient cohorts or data sets for research investigation [7]. As a significant amount of clinical information is represented in textual format, execution of such a human-assisted approach is time-consuming, labor-intensive, and non-standardized [7,8,9,10]. Natural language processing (NLP), a subfield of computer science and linguistics, has been leveraged to computationally process and analyze EHRs for clinical research [11, 12]. In the following sections, we exhibit two common use case examples of how clinical NLP techniques are leveraged to support research.

Use Case 1: Information Retrieval for Eligibility Screening or Cohort Identification

Information retrieval (IR) is the process of computationally ranking and acquiring information resources (e.g., patient phenotypic profiles and clinical documents) based on relevant information needs (i.e., queries) from a collection of resources (e.g., patient lists and clinical documents), where NLP techniques can be adopted [11]. Common IR applications in clinical research are eligibility screening (i.e., cohort identification or patient phenotype retrieval [13]), a process of determining a participant’s eligibility for enrolling in a study based on pre-defined inclusion and exclusion criteria [14, 15]. In recent years, an increasing number of academic institutions and medical centers have applied the IR technology to their internal EHR data to electronically screen eligible patients for clinical studies. Advanced Text Explorer (ATE) is an example of such an IR system developed by Mayo Clinic. The system leverages Elasticsearch, a distributed full-text search engine that is built on Apache Lucene, to handle large-scale real-time document retrieval tasks [16]. EMERSE is a similar IR system developed by the University of Michigan that leverages Apache Solr, also an Apache Lucene-based search engine, for document indexing [17].

For illustrative purposes, the IR system allows users to input customized queries based on the pre-defined eligibility criteria to search clinical documents for selecting or removing prospective study candidates. Based on the example presented in Fig. 21.1, studies designed to investigate the effect of night shift work on cognitive functioning would need to identify participants with a history of working nightshift. Subsequent queries can be established to search EHRs and identify prospective candidates. Based on the search result, users can decide to continue to improve the search query or conduct a chart review for case validation.

Fig. 21.1
A block diagram illustrates the eligibility criteria, query, and result where participants with a history of working night shifts are identified to investigate the effect of night shift work on cognitive functioning.

An example of using information retrieval (IR) for cohort identification

Use Case 2: Information Extraction for Assembling Clinical Research Data Sets

Information extraction (IE) is a sub-task of NLP aiming to automatically extract pre-defined clinical concepts from unstructured text through concept mention detection (i.e., named entity recognition [NER]) and concept normalization (i.e., map the mentions to concepts in standard or pre-defined terminologies) [9, 18,19,20,21,22,23]. IE can be utilized to assist clinical research by computationally extracting information from clinical documents and assembling a research data set for various research purposes. Common research tasks for clinical IE include case ascertainment [23, 24] and data abstraction [7, 25,26,27,28].

For illustrative purposes, a typical clinical IE task is presented in Fig. 21.2. Because delirium is underdiagnosed in clinical practice and is not routinely coded for billing, NLP can serve a distinct role to facilitate case ascertainment. In this particular use case, the goal is to extract cognitive and neuropsychological data elements based on the standard definition to identify patients with delirium from unstructured EHR text [29]. Based on the defined research objectives, the standard definition - confusion assessment method (CAM) is subsequently established by either adopting existing clinical criteria or developing new definitions by domain experts. Corresponding NLP algorithms are created based on these definitions and applied to relevant data sources such as clinical notes. We can then infer a positive status of delirium based on positive status of the extracted concepts “mental status decline,” “fluctuating attention,” and “disorganized thinking.” The generated results can then be used in downstream analytics to help answer specific clinical questions (e.g., how is delirium associated with outcomes in hospitalized COVID-19 patients?).

Fig. 21.2
A block diagram illustrates clinical problems, research questions, data challenges, gold standard definition, information extraction, and patient status.

An example of information extraction being used for delirium research

Foundations of Clinical Natural Language Processing

The steps involved in the development of a gold standard clinical corpus can be divided into five key components: (1) task formulation, (2) corpus annotation (e.g., annotation guideline development, training, and production), (3) model development, (4) model evaluation, and (5) model application (Fig. 21.3) [30,31,32,33]. In the ensuing subsections, we will delve into each of these components in further detail.

Fig. 21.3
A block diagram illustrates the task formulation, corpus annotation, model development, model evaluation, and model application of N L P development.

An overview of NLP development and evaluation for clinical research

Task Formulation

Formulation of a clinical NLP task involves defining targets of interest to extract, conducting a literature review, consulting domain experts, and identifying study stakeholders such as annotators with specialized knowledge [34]. Cohort screening is the process of identifying study participants based on eligibility criteria. The initial step is to establish a screening protocol highlighting detailed inclusion and exclusion definitions. These definitions will then be operationalized using EHR data such as patient demographics, procedure codes, diagnosis codes, and problem lists to assemble study cohorts. Based on the established cohort, corresponding clinical documents (e.g., clinical notes) are retrospectively retrieved leveraging APIs (Application Programming Interface) or SQL (Structured Query Language) to query against enterprise data warehouses.

Corpus Annotation

Corpus annotation is the practice of marking pre-defined clinical or linguistic information to a given document [35]. In general, there are three phases in the annotation process: (1) training and onboarding, (2) guideline development, and (3) annotation production and adjudication. The initial step starts by assembling an annotation team to identify key stakeholders such as annotators and adjudicators. This step is followed by organizing a preliminary meeting to discuss the overall goal of the study and to walk through the generic annotation process. Training sessions can be hosted to allow annotators to become familiar with the annotation tool and definitions of interest. In the guideline development phase, the process involves the development of a detailed annotation guideline specifying the common standards and definitions for the given task. The steps for developing guidelines can be iterative and commonly involves the following activities: prototyping a baseline guideline, performing annotation, calculating inter-annotator agreement (IAA), organizing consensus meetings, and updating guideline. IAA is often calculated through Cohen’s kappa coefficient [36] or F1-score [37]. The process repeats until a satisfactory performance is reached (e.g., a kappa agreement greater or equal to 0.9). Annotation production and adjudication can be organized into a batch-based process for quality control. The production process is similar to guideline development except for allowing more documents to be annotated per batch. Adjudication is the process to resolve inconsistencies between different annotators. There are several ways to perform adjudication. The most common method is to have a third independent domain expert direct overwrite the result or apply majority votes. Team- or panel-based adjudication can be applied for resolving challenging cases. When an independent adjudicator is not available, the two original annotators may reach the final consensus through extensive discussion.

Model Development

Due to the high prevalence and usage of information extraction applications in clinical research, we will primarily focus on IE-related methodologies in this section. Methods for developing IE applications can typically be stratified into symbolic, traditional machine learning (non-deep learning variants), deep learning, or hybrid approaches. The Linguistic String Project-Medical Language Processing (LSP-MLP) project was an early effort aiming to develop clinical IE applications to extract medical concepts from clinical narratives leveraging semantic lexicons (terms) and rules [38, 39]. Since 1990, there has been an increasing number of statistical NLP studies published [12]. Recent advances in computational technologies such as graphics processing units (GPUs) have influenced the adoption of deep learning approaches for clinical IE [40,41,42]. Through combining both symbolic and machine learning approaches, hybrid approaches have also gained substantial popularity due to the benefits of both comprehensiveness. The following sections provide a methodological overview of each approach.

Symbolic Approach

Symbolic or rule-based approaches use a comprehensive set of lexicons and rules to identify pre-defined patterns in text [43, 44]. This approach has been adopted in many clinical applications due to interpretability and customizability, i.e., the effectiveness of implementing domain-specific knowledge [9] and/or controlled vocabularies [45]. For example, one advantage of the symbolic approach is the ability to leverage existing resources such as clinical criteria, guidelines, medical dictionaries, and knowledge bases. The strategy is to incorporate well-curated clinical knowledge resources such as Unified Medical Language System (UMLS) Metathesaurus [46], Medical Subject Headings (MeSH) [47], and MEDLINE® to facilitate the curation and normalization of lexicons [48]. Based on specific tasks, the combination of rules and well-curated dictionaries can result in promising performance. In addition, to strengthen the ability for capturing important contextual patterns such as family history, negated, possible, and hypothetical sentences, context algorithms are commonly utilized. As an example, NegEx, developed by Chapman et al., is one of the most popular context algorithms used in clinical NLP [49].

The development of lexicons and rules is a manual and iterative process that can be summarized into the following steps: (1) adopting an existing symbolic NLP framework (see section “An Overview of Clinical NLP Systems and Toolkits”), (2) assessing existing knowledge resources, (3) crafting lexicons and rules based on clinical criteria and/or expert opinions, and (4) evaluating and refining lexicons and rules. The refinement of customized lexicons and rules is a recursive process involving multiple subject matter experts. At each iteration, the rules are applied to a reference standard corpus, and its results are evaluated. Based on the evaluation performance, domain experts review false classified mentions or sentences and determine the reasons for misclassification. This pattern was then repeated until it reached a reasonable performance (e.g., F1-score ≥ 0.95).

Traditional Machine Learning

“Traditional” machine learning (i.e., non-deep learning variants) can automatically learn patterns without explicit programming [50,51,52,53]. In contrast to deep learning methods, traditional machine learning approaches require more human intervention in the form of feature engineering, a process of selecting and converting raw text into features that can be used in machine learning models. Although feature engineering can be complex, the ability to process and learn from large document corpora greatly reduces the need to manually develop lexicons and rules.

The process of developing traditional machine learning models can be summarized into the following steps: task formulation, data pre-processing, word representation (feature engineering), model training, optimization, and evaluation. In clinical IE, there are two common tasks to be formulated: (1) classification: assign documents or sentences with pre-defined labels; and (2) structured prediction: sequence labeling and segmentation to recognize entities or other semantic units. Commonly reported clinical IE tasks include boundary detection-based classification and sequential labeling. Boundary detection is aimed at detecting the boundaries of the target type of information. For example, the BIO tags use B for beginning, I for inside, and O for outside of a concept. Sequential labeling-based extraction methods transform each sentence into a sequence of tokens with a corresponding property or label. One advantage of sequential labeling is the consideration of the dependencies of the target information. Existing pre-processing steps can be achieved by (1) segmenting documents into sentences, dividing a set of text into individual words (tokenization), and reducing a word to its word stem (stemming). Existing word representation methods for classification tasks include bag-of-words [54,55,56,57,58,59], continuous bag-of-words (CBOW) [60, 61], or word embedding [62,63,64,65] models. Traditional bag-of-words models convert words into a high-dimensional one-hot space, which potentially introduces sparsity, increases the size of data, and removes any sense of semantic similarity between words. Word embeddings can enhance the word semantic encoding by capturing latent syntactic and semantic similarities [66].

Frequently used traditional machine learning models for clinical IE include decision tree (DT) [67], logistic regression (LR) [68], Bayesian network [69], k nearest neighbor (k-NN) [70], random forests [71], hidden Markov model (HMM) [72], support vector machine (SVM) [73], structural support vector machines (SSVMs) [74], and conditional random fields (CRF) [75]. Among the aforementioned models, CRFs and the SVM are the two most popular models for clinical IE [76]. CRFs can be thought of as a generalization of LR for sequential data. SVMs use various kernels to transform data into a more easily discriminative hyperspace. In addition, structural support vector machines is an algorithm that combines the advantages of both CRFs and SVMs [76].

Deep Learning

Deep learning, a subfield of machine learning that focuses on learning patterns from dense representations of a large amount of data, has become an emerging trend in clinical NLP research [42, 77, 78]. In contrast to traditional machine learning approaches, deep learning approaches reduce the need to explicitly engineer data representations. In clinical NLP, the deep learning algorithms are focused on neural networks or their variants such as convolutional neural networks (CNN) [79,80,81,82], recurrent neural networks (RNN) [83,84,85], gated recurrent unit (GRU) [86], long short-term memory (LSTM) networks [87], and transformers [88].

CNN is a type of artificial neural network (ANN) that relies on convolutional filters to capture spatial relationships in the inputs and pooling layers to minimize computational complexity. Although the models have been found to be exceptionally effective for computer vision tasks, CNN may have a difficult time capturing long-distance relationships in text [89]. RNNs are neural networks that explicitly model connections along a sequence, making RNNs uniquely suited for tasks that require long short-term dependencies to be captured [90, 91]. Conventional RNNs are, however, limited in modeling capability by the length of text due to problems with vanishing gradients. Variants such as LSTM [87] and GRU [86] have been developed to address this issue by separating the propagation of the gradient and control of the propagation through “gates.” Meanwhile, many of the researchers have combined deep learning architectures with the CRF framework to further improve the model performance. This is to take advantage of their relative strengths: long-distance modeling of RNNs and CRF’s ability to jointly connect output tags. Well-known architectures include CNN-CRF, Bi-LSTM-CRF, and Bi-LSTM-Attention-CRF. More recently, transformer architectures have been proposed to further improve the ability to capture complex dependencies and context. The architecture enables the segmentation of sentences, and adding subsequent layers is therefore needed to allow the model to accommodate long sequences of text without crippling memory constraints [88]. Thus, transformers can effectively model relationships with long word distance and are much more computationally efficient compared to RNN variants. Pre-trained representations based on this architecture such as BERT [40] and GPT [92] have yielded significant improvements in state-of-the-art performance in many NLP tasks [93].


Leveraging the advantages of both rule- and machine learning-based approaches, hybrid approaches combine them into one system potentially offering a comprehensive solution. There are two major hybrid architectures. The first architecture uses a symbolic system to extract features. These features are then will then be used as input for the machine learning system. This architecture may have the potential of achieving improved performance compared with purely symbolic or machine learning-based approach due to the informative features supplied by the symbolic system. As an example, Szarvas et al. applied pattern-based trigger words to improve their NER model for clinical de-identification tasks [94]. The second architecture uses machine learning approaches (or symbolic approaches) to rectify incorrect cases from symbolic approaches (or machine learning approaches). This architecture is also referred to as a “supplemental hybrid approach” or “post-hoc design” [23, 95] and has been leveraged to develop a generic IE framework [96] or to extract specific concept mentions [95, 97].

Model Evaluation

Rigorous model evaluation is crucial for developing valid and reliable clinical IE applications. Evaluation starts by defining the granularity of subjects to be assessed. Common levels of granularity include concept (or mention), sentence, document, and patient. The specific level selected with which evaluation was performed is typically determined based on the specific task or application. Most studies reported using the combination of concept and document-level evaluations [23]. Once the level is defined, the evaluation can then be performed by constructing a confusion matrix or a contingency table to derive error ratios including true positives, false positives, false negatives, and true negatives. From these measures, common evaluation metrics, including sensitivity or recall, specificity, precision or positive predictive value (PPV), negative predictive value (NPV), and F1-score or F-measure, can then be determined based on the error ratios. F1-score that measures the harmony of sensitivity and precision is a well-established metric in the information retrieval community [37]. In addition, the area under the ROC curve (AUC) and the area under the precision-recall curve (PRAUC) are commonly used for evaluating machine learning models. The designs for evaluation include the hold-out method, where the model is trained on training sets and evaluated on the blinded test set or (nested-) cross-validation (CV), where the prediction error of a model is estimated by iteratively training part of the data and leaving the rest for testing [98, 99].

Model Application

After the evaluation process is finished, the model can be deployed and applied to assemble clinical cohorts or assist in data abstraction in the context of the problem that the model is designed for. The process can be achieved by treating the model as a standalone tool. Corresponding clinical data can be assembled by following the steps highlighted in the section “Task Formulation”. A more integrated solution is to deploy the model into the existing data infrastructure or EHR environment. However, the implementation process varies and can be dependent on the maturity of each site’s specific infrastructure and policy [100].

A Step-by-Step Case Demonstration

In this section, we present a step-by-step case demonstration for developing two different NLP approaches (symbolic and deep learning) under a case study of aging. Falls are a leading cause of unintentional injury. However, studies have found that the use of billing codes may underestimate true fall events [101]. The case study aims to fully leverage the EHR data and NLP to accurately identify fall events from clinical notes. We supplied additional supporting materials to assist the case demonstration (

Task Formulation

The task was defined to develop two NLP models (symbolic and pre-trained language approaches) to extract fall-related mentions and sentences from clinical notes at Mayo Clinic Rochester. A literature review was conducted to identify existing methods and dictionaries for adoption [101,102,103,104,105,107]. Domain experts included in the project are two geriatricians and one palliative care physician. A screening protocol was co-developed by the study team using diagnosis codes. The protocol defines the study participants as Mayo Clinic Biobank patients with age greater or equal to 65 at the time of enrollment. Cases were identified using fall-related ICD-9 and 10 codes: E804, E833–E835, E843, E880–E888, E917.5–E917.8, E929.3, E987, and W00.0XXA-W18.49XS. Controls were matched with age and sex. A total of 300 patients (150 cases and 150 controls) were assembled through an open-source clinical data warehousing research platform i2b2 (Informatics for Integrating Biology & the Bedside) [108] (Fig. 21.4). Clinical notes were subsequently retrieved for these 300 patients directly from the enterprise data warehouse (EDW) using customized SQL.

Fig. 21.4
A screenshot of integrating biology and the bedside window illustrates the cohort screening interface which includes the terms folder, query tool, and graph results for the number of patients.

Cohort screening interface based on i2b2

Corpus Annotation

In this example, the task was formulated as annotating mentions of fall-related expressions in clinical notes. The annotation team is assembled with two trained nurse abstractors as annotators and one geriatrician as the adjudicator. We choose MedTator as the annotation tool. MedTator is a free and serverless annotation tool released under the Apache Software License [109]. To develop an annotation guideline, we first adopt existing definitions from the ANA National database for nursing quality indicators [11]: “An unplanned descent to the floor (or extension of the floor, e.g., trash can or other equipment) with or without injury.” Fall events that result from either physiological reasons or environmental reasons are included. Based on this definition, the annotation task can be specified as highlighting both fall-related mentions, indications, and the associated attributes as presented in Table 21.1. Based on the annotation definition, the corresponding annotation schema (.dtd file) is created (Textbox 21.1).

Table 21.1 Fall annotation definition

Textbox 21.1 Example of Fall Annotation Schema in .dtd Format

A text box illustrates the 13-line example of fall annotation schema in dot d t d format.

Once the schema is created, annotation can be performed using the MedTator tool. The tool can be accessed through the URL: After the web interface is opened, the first step is to load the annotation schema. This can be achieved by dragging the .dtd file to the top left (first) box. Similarly, raw clinical documents can be dragged into the second box for annotation. If you don’t have a schema or text file yet, you could explore the online sample by clicking the “Sample” button in the top right location.

According to the example presented in Fig. 21.5, “risk of falling” is highlighted as “fall_mention” with certainty as “confirmed,” status as “current,” patient as “experiencer,” and exclusion as “yes.” “fall from ladder” is highlighted as a “fall_mention” with certainty as “confirmed,” status as “past,” patient as “experiencer,” and exclusion as “no.” During the annotation, the task is usually defined to treat each unique concept independently. It is recommended to choose the smallest possible span that semantically encloses the problem, condition, or diagnosis. Additional annotation best practices can be found at

Fig. 21.5
A screenshot of the MedTator interface represents the highlighted risk of falling under the impression, report, plan, and fall from the ladder under the chief complaint.

MedTator interface for fall annotation

Model Development

Symbolic Approach

We use the open-source clinical NLP pipeline MedTagger ( to develop the symbolic model. First, the initial keywords and regex search patterns based on existing studies [12, 101,102,103,104,105,107] and domain experts are compiled (Textbox 21.2). These patterns are then applied to the training data. False-positive and false-negative cases are manually reviewed for refinement. This process is repeated after an acceptable performance is reached (e.g., F1-score > 0.95).

Textbox 21.2 Example Keywords and Regex Patterns for Fall Identification

a fall; recurrent fall; time of fall; falls?; fell; fallen; collapsed; slipped; tripped; syncope; falling; syncopal (events?|episodes?|spells?); found (\S+\s+){0,3}on the ground; on (\S+\s+){0,3}way down

Deep Learning Approach

We use BERTbase, a pre-trained model with pre-trained sentences on unpublished books and Wikipedia, to perform the sequential sentence classification task. The pre-trained BERT model is adopted from the original Google BERT GitHub repository ( The model contains 768 hidden layers and 12 self-attention heads. For the model fine-tuning, the maximum sequence length (e.g., 512) and batch size (e.g., 32) need to be configured. The early stopping technique is applied to identify the epoch number and prevent overfitting. Sample codes for both approaches can be found at

Model Evaluation

The models are evaluated on an independent test set based on the mention or sentence level. The presented evaluation results in Fig. 21.6 indicated the model achieve 0.895, 0.9912, 0.770, 0.997, and 0.828 in sensitivity, specificity, PPV, NPV, and F1-score, respectively. The error analysis can be performed by manually reviewing incorrect cases. Through the error analysis, we are able to identify false-negative and false-positive samples for future improvement.

Fig. 21.6
A confusion matrix represents N L P versus the gold standard for fall, no fall, and total. On the left and right sides of the matrix are false negative and false positive samples.

Example of confusion matrix and error cases

Clinical NLP Resources

An Overview of Clinical NLP Community Challenges

Clinical NLP-related challenges or shared tasks are community activities or competitions with the objective of developing task-specific NLP algorithms within a certain timeline. Solutions will be evaluated using standardized criteria across all participating teams. The top winning team will be awarded small prizes or be invited to disseminate their methods through conference or journal submissions. The challenge starts by calling for participation and releasing the task details. For example, in the 2019 National NLP Clinical Challenge (n2c2) Family History Extraction challenge, the task was to extract mentions of family members in clinical notes and observations (diseases) in the family history. Common timeline for the challenge includes participant registration (e.g., team formulation, data usage agreement), training data release, test data release, submission due, results release, and abstract or manuscript submission. Community challenges have been serving as a vital role in advancing NLP methodologies, disseminating NLP knowledge resources (e.g., annotation guidelines and corpora), engaging informatics researchers, and promoting interdisciplinary collaboration. Furthermore, since the tasks in each challenge are well-defined and standardized by the organizers, coupling with de-identified and made publicly accessible corpora, they are usually regarded as standard benchmarks for the state-of-the-art NLP performance evaluation. Well-known clinical NLP tasks include the Semantic Evaluation (SemEval) challenges [109,110,112], BioCreative/OHNLP [112,113,114,116], the Informatics for Integrating Biology and the Bedside (i2b2) challenges [116,117,118,119,121], the National NLP Clinical Challenge (n2c2) [122], and the Conference and Labs of the Evaluation Forum (CLEF) eHealth challenges [110, 111].

An Overview of Clinical NLP Systems and Toolkits

An Overview of Clinical NLP Systems

NLP systems (frameworks) are important resources for the development, standardization, and streamlined execution of symbolic methods. The key advantage of NLP systems is the built-in and modularized text (pre-)processing pipeline such as sentence detector, tokenizer, part-of-speech tagger, chunking annotator, section detector, information extractor, and context annotator [123, 124]. Different NLP systems have been developed at different institutions, including MedLEE [125], MetaMap [126], KnowledgeMap [127], cTAKES [123], HiTEX [128], CLAMP [129], and MedTagger [124]. MedLEE is one of the earliest clinical NLP systems developed and was originally developed for providing clinical decision support for radiographs. The system has been subsequently expanded for processing different clinical documents such as discharge summaries, pathology reports, and radiology reports [125, 130]. MetaMap, developed by the National Library of Medicine (NLM), is a highly configurable system for providing access and mapping from clinical text to the Unified Medical Language System (UMLS) Metathesaurus [126]. cTAKES is one of the most commonly used tools developed using the Unstructured Information Management Architecture framework (UIMA) [131] and OpenNLP natural language processing toolkit under the Apache project. MedTagger is a resource-driven open-source UIMA-based IE framework developed under the Open Health Natural Language Processing (OHNLP) Consortium aiming to create an interoperable, scalable, and usable NLP ecosystem [124]. Meanwhile, major technology companies have all embraced clinical NLP with commercial solutions available on the market (e.g., IBM Watson [132], Google Healthcare Natural Language API [133], or Amazon Comprehend Medical [134]).

An Overview of Clinical NLP Toolkits and Packages

NLP packages and toolkits are useful resources for developing clinical NLP solutions, especially for text-preprocessing and machine learning approaches. Well-known toolkits include WEKA [135], MALLET [136], OpenNLP [137], SPLAT [138], NLTK [139], and SpaCy [140]. Recently, there has been a rapid growth in the number of open-source deep learning packages (frameworks). Common examples of these packages are Torch [141], Theano [142], MxNet [143], TensorFlow [144], PyTorch [145], Keras [146], and CNTK [147]. Although studies have found variations in the GPU performance and memory management among these libraries [148, 149], most of the packages share similar core competencies, and the selection of appropriate packages can be based on the research environment and user preference.

Challenges, Opportunities, and Future Directions

Despite the notable benefits of leveraging NLP to facilitate clinical research, there remain several open challenges. In this section, we discussed three challenges that need to be investigated in the future including reproducibility and scientific rigor, multisite NLP collaboration, and federated learning and evaluation.

Reproducibility and Scientific Rigor

Considering that many NLP solutions could serve as middleware applications (i.e., supplying research data) for clinical research, the validity of research outcomes for such studies is dependent on the robustness and trustworthiness of the NLP models used as well as the quality of the data being fed into these models [149,150,152]. Existing clinical NLP applications face challenges in the form of various data quality issues caused by the heterogeneity of the EHR environment. Since EHR systems are primarily designed for patient care and billing, routinely generated and documented clinical information may suffer from potential data quality issues when being used for clinical research. Furthermore, the EHR system itself may have a strong impact on the syntactic and semantic meaning of patient narratives due to its built-in documentation functionality such as smart forms, templates, and macros. Therefore, it is important to have a good understanding of EHR data before the model development and deployment effort. In addition to data quality, reproducibility, which measures the ability to obtain the same (or similar enough) result following the same (or sufficient details) computational steps, is another important criterion for trusted NLP solutions. In the context of clinical NLP, the criterion emphasizes the need for information resource (e.g., corpus, system, and associated research metadata such as inclusion and exclusion criteria used) provenance and process transparency to ensure scientific rigor. Another quality dimension that is commonly referred to as a potential factor of “user trust” and safety is interpretability [153]. In clinical research, the explanations of NLP results may serve as important criteria for the evaluation of the model’s capability to explain why a certain decision is made.

Multisite NLP Collaboration

Compared with manual chart review, NLP solutions are distinctive in their ability to systematically extract clinical concepts from clinical text, offering high-throughput solutions for automated data abstraction across multiple different institutions. Therefore, NLP has strong potential to be used to facilitate multisite clinical research collaborations and national-wide research registry development. However, successfully deploying an existing NLP solution to a different EHR environment is nontrivial. We highlight three important NLP dimensions to be considered including implementability, portability, and customizability. Implementability evaluates the feasibility of deploying NLP solutions to the clinical environment. The NLP implementation process is highly dependent on institutional infrastructure, system requirements, data usage agreements, and research and practice objectives. Besides, how NLP models are packaged can also affect the complexity of implementation. For example, whether the NLP solutions can be packaged into a standalone tool or need to be integrated into existing infrastructures would demand different implementation processes [100]. After the deployment, the performance of NLP needs to be re-evaluated in each local environment. Many studies have found that NLP algorithms developed in one institution for a study may not perform well when reused in the same institution or deployed to a different institution or for different studies [154]. The degradation of NLP performance at a different site is often referred to as an NLP portability issue. The differences in EHR systems, care practice, and data documentation standards across institutions may contribute to the variability in clinical documentation and non-optimal performance of NLP systems. To address that, a local evaluation and refinement process can potentially improve the system. The feasibility of system refinement is dependent on the customizability of each system, which measures how easily each model can be adapted, modified, and refined based on existing implementation when a concept definition is changed or there is an update to clinical guidelines. This quality dimension can affect the choices between different NLP approaches (e.g., symbolic vs. machine learning) for multisite studies.

Federated Learning and Evaluation

Another barrier of developing robust and portable NLP solutions is the lack of multisite data due to the regulations, privacy, and security requirements surrounding protected health information (PHI) and the high cost of creating well-annotated and curated clinical corpus [34, 155]. Federated learning, a machine learning approach to train statistical models on remote devices, can be potentially leveraged to address data sharing challenges [156, 157]. The learning can be achieved by allowing individual sites to collaboratively train a model and send incremental updates for immediate aggregation to achieve the shared learning objectives without the need to distribute data [156, 157]. Traditional federated learning is, however, limited only to machine learning approaches. To further enhance the process transparency and model interpretability, the OHNLP Consortium [158] adapt the federated learning approach and proposed a collaborative NLP development framework [159]. The framework contains a user-centric crowdsourcing interface for collaborative ruleset development and a transparent multisite participation workflow on corpus development and evaluation [159]. Site-specific knowledge and findings can therefore be effectively aggregated and synthesized. Another similar concept is federated evaluation, a process of deploying NLP solutions to local institutions, running models on local data, sharing performances to a centralized location (e.g., cloud server). For example, the NLP Sandbox, developed by the National Center for Data to Health (CD2H), is a federated evaluation platform that enables the continuous benchmarking of NLP models on data hosted at different sites through Docker containers. Through this approach, institutional-specific findings and knowledge can be learned and shared without transferring PHI information.


In conclusion, this chapter provided an overview of clinical NLP in the context of the secondary use of EHR for clinical research. A case study of aging was conducted to demonstrate an end-to-end process of NLP development and evaluation. We further discussed three open challenges and highlighted the importance of translational science and community engagement efforts for leveraging clinical NLP applications to support research.