Introduction

Developments in machine learning have enabled rapid advancement in natural language processing (NLP) of text-based clinical notes in electronic health records (EHR) and radiology reports [1, 2]. Many NLP applications are trained on large, publicly available repositories of clinical notes such as i2b2/n2c2 and MIMIC; such repositories can be essential to the development of new NLP tools to improve patient care [3]. Although numerous repositories of radiological images are publicly available, there are few public repositories of radiology reports [4]. Such reports contain a wealth of information – such as imaging findings, diagnoses, conclusions, and recommendations – that could be leveraged by new NLP tools. Already, NLP tools trained using radiology reports can aid in diagnostic surveillance, cohort building, query-based case retrieval, and quality assessment [5].

Laws such as the U.S. Health Insurance Portability and Accountability Act (HIPAA) [6] and the European Union’s General Data Protection Regulation (GDPR) [7] impose penalties for the release of protected health information. De-identification of radiology reports to make them publicly available poses a major obstacle to the construction of a public multi-site repository [8].

De-identification of clinical text – the identification and removal of protected health information (PHI) such as names, dates, patient numbers, and other identifiers – remains a difficult task in part due to the variations in structure, format, language, and distribution of PHI. Although some de-identification tools are publicly available, most have been trained on and developed for clinical notes [9,10,11]. Recent studies have shown that domain adaptation is required to adequately de-identify texts from other domains, and that these existing tools exhibit reduced performance when applied to radiology reports [12].

Ensemble methods combine the predictions of multiple decision models. Such ensembles use a wide range of methods of voting and weighting each vote to achieve greater performance in a variety of contexts [13, 14]. In this work, we developed and evaluated ensemble methods for de-identification of PHI from radiology reports that combined the predictions of three publicly available de-identification tools originally made for clinical notes. We tested the hypothesis that performance of an ensemble method could surpass that of individual de-identification tools.

Materials and Methods

Dataset

The study was approved by the organization’s Institutional Review Board. The study data were collected during routine clinical care. A random sample of 2,503 radiology reports from January 1, 2012 to January 8, 2019 was assembled retrospectively from a large, multi-hospital U.S. medical system, which included academic and community radiology practices with urban, suburban, and rural practice sites. Reports were created by more than 300 attending radiologists and radiology trainees. The reports included a variety of imaging modalities. We only evaluated final reports, and did not distinguish between reports with and without involvement of a trainee. Reports were partitioned into disjoint training (n = 1480) and testing (n = 1023) sets.

Annotation

PHI was defined to match the Safe Harbor criteria of the HIPAA Privacy rule [8], with the addition of three categories: names of healthcare workers, names of hospitals, and names of software/tools. All reports were labelled by two annotators to ensure inter-annotator reliability and produce labels of higher accuracy; differences of opinion were resolved by consensus. The frequency of PHI in the testing set is shown in Table 1.

Table 1 Distribution of PHI in the 1023 radiology reports used for evaluation. The Safe Harbor method from the HIPAA Privacy rule was used to define standardized PHI, with the addition of three categories: names of healthcare workers, names of hospitals, and names of software/tools

De-Identification Software

We incorporated three publicly available de-identification tools, all developed originally for clinical notes. MIT deid and Philter incorporate a variety of dictionaries, rules, and expressions to identify PHI and do not incorporate machine learning [9, 11]. NeuroNER is a machine learning model that uses recurrent neural networks to identify various forms of PHI [10]; in this work, we used the NeuroNER model pre-trained on radiology reports, as described by Steinkamp et al. [14].

Ensemble Methods

We defined three simple voting approaches (1-Vote, 2-Votes, and 3-Votes) as positive if one or more, two or more, or all three of the primary models were positive, respectively. The value of each entry in the array was “1” if the existing tool had identified the corresponding token as PHI, and “0” otherwise (Fig. 1). The simple voting approaches were applied to all tokens in the 1023 reports in the test set.

Fig. 1
figure 1

Workflow for ensemble de-identification methods. PHI was detected at the token level, and the distribution of PHI in the dataset at both the report and token level can be found in Table 1

The three features were then used as inputs into three traditional machine learning classifiers (Decision Tree, Bayesian, and Boosting) with a 60/40 train/test split. In a Decision Tree classifier, a set of features is used as input into a tree structure, where each internal node tests a specific attribute and is assigned to a class [14]. In a Naïve Bayes (Bayesian) classifier, the distribution of the outputs as well as the conditional distribution of the inputs on the classes are estimated, and Bayes rule is applied to obtain the posterior probability of a sample belonging to a class given its corresponding inputs [14]. Here we use a Gaussian Naïve Bayes classifier, where the likelihood of the inputs is assumed to be based on a Gaussian (normal) distribution. In an Adaboost (boosting) classifier, multiple weak classifiers are successively trained with increasing weight placed on misclassified observation, then combined into an ensemble (15). Precision, recall, and F1 score were computed for the voting algorithms and machine learning ensembles.

Results

The performance of MIT deid, NeuroNER, and Philter on their original datasets and on the testing set of radiology reports is shown in Table 2. Performance metrics included precision (positive predictive value, the fraction of relevant instances among the retrieved instances) and recall (sensitivity, the fraction of relevant instances that were retrieved), and F1 (the harmonic mean of precision and recall). All three tools demonstrated lower recall on radiology reports than their reported performance on clinical notes.

Table 2 Performance of the three publicly available de-identification tools on their original datasets and on the test set of radiology reports [7,8,9]. Performance metrics for three publicly available de-identification tools on the testing set of 1,023 reports, with 95% confidence intervals [10]

The performance metrics for the ensemble classifiers are shown in Table 3. The simple voting classifiers were evaluated on all 1,023 testing-partition reports; the other three ensemble classifiers were evaluated on the 40% test split of the testing partition reports. Not surprisingly, the 3-Votes ensemble achieved greater precision and the 1-Vote algorithm had greater recall than any of the underlying classifiers. The Decision Tree and Boosting classifiers also demonstrated greater precision than the three individual tools, while the Bayesian classifier showed greater recall.

Table 3 Performance of the ensemble models with 95% confidence intervals

Discussion

Summary

The three individual publicly available de-identification tools – MIT deid, NeuroNER, and Philter – all demonstrated lower recall on radiology reports than on clinical notes, likely because radiology reports can differ from clinical notes in both structure and language. The distribution of PHI in radiology reports can differ substantially from the general-purpose clinical notes used to train many text de-identification tools [14]. For example, in our radiology reports dataset, dates were the most common form of PHI and can vary widely in format, making them difficult to capture with solely rule-based de-identification methods (Table 1). The performance of existing de-identification tools made for clinical notes in radiology reports is inadequate for clinical and research use, indicating a need for improved de-identification methods specifically for radiology reports.

No ensemble method was able to outperform NeuroNER in both precision and recall, but both the Bayesian classifier and 1-Vote basic voting algorithm outperformed NeuroNER by exceeding 95% in recall. This performance represents a substantial improvement: de-identification tasks tend to prioritize recall over precision because a false negative requires significant manual review to identify. By using an ensemble method, one effectively pools the predictions of multiple tools such that if a PHI token goes undetected by one tool, it could still be detected as PHI by another tool to enable superior recall.

Limitations

Although these methods are promising, they were evaluated on a limited dataset from a single health system with a skewed distribution of PHI.

Future Work

Future work will include assessing performance of ensemble methods from a larger multicenter dataset to incorporate more variations in format. Data augmentation will be applied to better quantify tool performance on types of PHI that are under-represented in this dataset, such as the names of patients, family members, and healthcare workers. We also will examine whether the addition of more publicly available de-identification tools to the ensemble can improve recall performance.

Conclusions

In this work, we have developed ensemble methods for de-identification of PHI in radiology reports that incorporate publicly available de-identification tools developed for clinical notes. We have shown that the Bayesian classifier and ≥ 1 threshold basic voting algorithm were able to outperform the best individual de-identification tool (NeuroNER) in recall, indicating that these ensemble methods show substantial promise for larger-scale implementation in building a publicly available corpus of de-identified radiology reports. Future work includes evaluation on a larger multicenter dataset augmented with under-represented forms of PHI, as well as incorporation of additional de-identification tools into the ensembles.