Ensemble Approaches to Recognize Protected Health Information in Radiology Reports

Horng, Hannah; Steinkamp, Jackson; Kahn, Charles E.; Cook, Tessa S.

doi:10.1007/s10278-022-00673-0

Ensemble Approaches to Recognize Protected Health Information in Radiology Reports

Published: 17 June 2022

Volume 35, pages 1694–1698, (2022)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Journal of Digital Imaging Aims and scope Submit manuscript

Ensemble Approaches to Recognize Protected Health Information in Radiology Reports

Download PDF

Hannah Horng¹,
Jackson Steinkamp²,
Charles E. Kahn Jr. ORCID: orcid.org/0000-0002-6654-7434^2,3 &
…
Tessa S. Cook²^na1

527 Accesses
1 Citation
6 Altmetric
Explore all metrics

Abstract

Natural language processing (NLP) techniques for electronic health records have shown great potential to improve the quality of medical care. The text of radiology reports frequently constitutes a large fraction of EHR data, and can provide valuable information about patients’ diagnoses, medical history, and imaging findings. The lack of a major public repository for radiological reports severely limits the development, testing, and application of new NLP tools. De-identification of protected health information (PHI) presents a major challenge to building such repositories, as many automated tools for de-identification were trained or designed for clinical notes and do not perform sufficiently well to build a public database of radiology reports. We developed and evaluated six ensemble models based on three publically available de-identification tools: MIT de-id, NeuroNER, and Philter. A set of 1023 reports was set aside as the testing partition. Two individuals with medical training annotated the test set for PHI; differences were resolved by consensus. Ensemble methods included simple voting schemes (1-Vote, 2-Votes, and 3-Votes), a decision tree, a naïve Bayesian classifier, and Adaboost boosting. The 1-Vote ensemble achieved recall of 998 / 1043 (95.7%); the 3-Votes ensemble had precision of 1035 / 1043 (99.2%). F1 scores were: 93.4% for the decision tree, 71.2% for the naïve Bayesian classifier, and 87.5% for the boosting method. Basic voting algorithms and machine learning classifiers incorporating the predictions of multiple tools can outperform each tool acting alone in de-identifying radiology reports. Ensemble methods hold substantial potential to improve automated de-identification tools for radiology reports to make such reports more available for research use to improve patient care and outcomes.

Automatic detection of actionable findings and communication mentions in radiology reports using natural language processing

Article 06 January 2022

Automated Detection of Radiology Reports that Require Follow-up Imaging Using Natural Language Processing Feature Engineering and Machine Learning Classification

Article 03 September 2019

Using BERT models for breast cancer diagnosis from Turkish radiology reports

Article 10 June 2023

Discover the latest articles, news and stories from top researchers in related subjects.

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Developments in machine learning have enabled rapid advancement in natural language processing (NLP) of text-based clinical notes in electronic health records (EHR) and radiology reports [1, 2]. Many NLP applications are trained on large, publicly available repositories of clinical notes such as i2b2/n2c2 and MIMIC; such repositories can be essential to the development of new NLP tools to improve patient care [3]. Although numerous repositories of radiological images are publicly available, there are few public repositories of radiology reports [4]. Such reports contain a wealth of information – such as imaging findings, diagnoses, conclusions, and recommendations – that could be leveraged by new NLP tools. Already, NLP tools trained using radiology reports can aid in diagnostic surveillance, cohort building, query-based case retrieval, and quality assessment [5].

Laws such as the U.S. Health Insurance Portability and Accountability Act (HIPAA) [6] and the European Union’s General Data Protection Regulation (GDPR) [7] impose penalties for the release of protected health information. De-identification of radiology reports to make them publicly available poses a major obstacle to the construction of a public multi-site repository [8].

De-identification of clinical text – the identification and removal of protected health information (PHI) such as names, dates, patient numbers, and other identifiers – remains a difficult task in part due to the variations in structure, format, language, and distribution of PHI. Although some de-identification tools are publicly available, most have been trained on and developed for clinical notes [9,10,11]. Recent studies have shown that domain adaptation is required to adequately de-identify texts from other domains, and that these existing tools exhibit reduced performance when applied to radiology reports [12].

Ensemble methods combine the predictions of multiple decision models. Such ensembles use a wide range of methods of voting and weighting each vote to achieve greater performance in a variety of contexts [13, 14]. In this work, we developed and evaluated ensemble methods for de-identification of PHI from radiology reports that combined the predictions of three publicly available de-identification tools originally made for clinical notes. We tested the hypothesis that performance of an ensemble method could surpass that of individual de-identification tools.

Materials and Methods

Dataset

The study was approved by the organization’s Institutional Review Board. The study data were collected during routine clinical care. A random sample of 2,503 radiology reports from January 1, 2012 to January 8, 2019 was assembled retrospectively from a large, multi-hospital U.S. medical system, which included academic and community radiology practices with urban, suburban, and rural practice sites. Reports were created by more than 300 attending radiologists and radiology trainees. The reports included a variety of imaging modalities. We only evaluated final reports, and did not distinguish between reports with and without involvement of a trainee. Reports were partitioned into disjoint training (n = 1480) and testing (n = 1023) sets.

Annotation

PHI was defined to match the Safe Harbor criteria of the HIPAA Privacy rule [8], with the addition of three categories: names of healthcare workers, names of hospitals, and names of software/tools. All reports were labelled by two annotators to ensure inter-annotator reliability and produce labels of higher accuracy; differences of opinion were resolved by consensus. The frequency of PHI in the testing set is shown in Table 1.

Table 1 Distribution of PHI in the 1023 radiology reports used for evaluation. The Safe Harbor method from the HIPAA Privacy rule was used to define standardized PHI, with the addition of three categories: names of healthcare workers, names of hospitals, and names of software/tools

Full size table

De-Identification Software

We incorporated three publicly available de-identification tools, all developed originally for clinical notes. MIT deid and Philter incorporate a variety of dictionaries, rules, and expressions to identify PHI and do not incorporate machine learning [9, 11]. NeuroNER is a machine learning model that uses recurrent neural networks to identify various forms of PHI [10]; in this work, we used the NeuroNER model pre-trained on radiology reports, as described by Steinkamp et al. [14].

Ensemble Methods

We defined three simple voting approaches (1-Vote, 2-Votes, and 3-Votes) as positive if one or more, two or more, or all three of the primary models were positive, respectively. The value of each entry in the array was “1” if the existing tool had identified the corresponding token as PHI, and “0” otherwise (Fig. 1). The simple voting approaches were applied to all tokens in the 1023 reports in the test set.

The three features were then used as inputs into three traditional machine learning classifiers (Decision Tree, Bayesian, and Boosting) with a 60/40 train/test split. In a Decision Tree classifier, a set of features is used as input into a tree structure, where each internal node tests a specific attribute and is assigned to a class [14]. In a Naïve Bayes (Bayesian) classifier, the distribution of the outputs as well as the conditional distribution of the inputs on the classes are estimated, and Bayes rule is applied to obtain the posterior probability of a sample belonging to a class given its corresponding inputs [14]. Here we use a Gaussian Naïve Bayes classifier, where the likelihood of the inputs is assumed to be based on a Gaussian (normal) distribution. In an Adaboost (boosting) classifier, multiple weak classifiers are successively trained with increasing weight placed on misclassified observation, then combined into an ensemble (15). Precision, recall, and F1 score were computed for the voting algorithms and machine learning ensembles.

Results

The performance of MIT deid, NeuroNER, and Philter on their original datasets and on the testing set of radiology reports is shown in Table 2. Performance metrics included precision (positive predictive value, the fraction of relevant instances among the retrieved instances) and recall (sensitivity, the fraction of relevant instances that were retrieved), and F1 (the harmonic mean of precision and recall). All three tools demonstrated lower recall on radiology reports than their reported performance on clinical notes.

Table 2 Performance of the three publicly available de-identification tools on their original datasets and on the test set of radiology reports [7,8,9]. Performance metrics for three publicly available de-identification tools on the testing set of 1,023 reports, with 95% confidence intervals [10]

Full size table

The performance metrics for the ensemble classifiers are shown in Table 3. The simple voting classifiers were evaluated on all 1,023 testing-partition reports; the other three ensemble classifiers were evaluated on the 40% test split of the testing partition reports. Not surprisingly, the 3-Votes ensemble achieved greater precision and the 1-Vote algorithm had greater recall than any of the underlying classifiers. The Decision Tree and Boosting classifiers also demonstrated greater precision than the three individual tools, while the Bayesian classifier showed greater recall.

Table 3 Performance of the ensemble models with 95% confidence intervals

Full size table

Discussion

Summary

The three individual publicly available de-identification tools – MIT deid, NeuroNER, and Philter – all demonstrated lower recall on radiology reports than on clinical notes, likely because radiology reports can differ from clinical notes in both structure and language. The distribution of PHI in radiology reports can differ substantially from the general-purpose clinical notes used to train many text de-identification tools [14]. For example, in our radiology reports dataset, dates were the most common form of PHI and can vary widely in format, making them difficult to capture with solely rule-based de-identification methods (Table 1). The performance of existing de-identification tools made for clinical notes in radiology reports is inadequate for clinical and research use, indicating a need for improved de-identification methods specifically for radiology reports.

No ensemble method was able to outperform NeuroNER in both precision and recall, but both the Bayesian classifier and 1-Vote basic voting algorithm outperformed NeuroNER by exceeding 95% in recall. This performance represents a substantial improvement: de-identification tasks tend to prioritize recall over precision because a false negative requires significant manual review to identify. By using an ensemble method, one effectively pools the predictions of multiple tools such that if a PHI token goes undetected by one tool, it could still be detected as PHI by another tool to enable superior recall.

Limitations

Although these methods are promising, they were evaluated on a limited dataset from a single health system with a skewed distribution of PHI.

Future Work

Future work will include assessing performance of ensemble methods from a larger multicenter dataset to incorporate more variations in format. Data augmentation will be applied to better quantify tool performance on types of PHI that are under-represented in this dataset, such as the names of patients, family members, and healthcare workers. We also will examine whether the addition of more publicly available de-identification tools to the ensemble can improve recall performance.

Conclusions

In this work, we have developed ensemble methods for de-identification of PHI in radiology reports that incorporate publicly available de-identification tools developed for clinical notes. We have shown that the Bayesian classifier and ≥ 1 threshold basic voting algorithm were able to outperform the best individual de-identification tool (NeuroNER) in recall, indicating that these ensemble methods show substantial promise for larger-scale implementation in building a publicly available corpus of de-identified radiology reports. Future work includes evaluation on a larger multicenter dataset augmented with under-represented forms of PHI, as well as incorporation of additional de-identification tools into the ensembles.

Availability of Data and Material

The radiology reports evaluated in this study contain protected health information (PHI), and thus cannot be made available publicly.

Code Availability

The de-identification tools evaluated here are publicly available: MIT deid (https://physionet.org/content/deid/1.1/), NeuroNER (https://github.com/Franck-Dernoncourt/NeuroNER), and Philter (https://github.com/BCHSI/philter-ucsf).

References

Sheikhalishahi S, Miotto R, Dudley JT, Lavelli A, Rinaldi F, Osmani V. Natural language processing of clinical notes on chronic diseases: Systematic review. JMIR Med Inform 2019; 7(2):e12239.
Article PubMed PubMed Central Google Scholar
Luo JW, Chong JJR. Review of natural language processing in radiology. Neuroimaging clinics of North America 2020; 30(4):447-458.
Article PubMed Google Scholar
Wu S, Roberts K, Datta S, et al. Deep learning in clinical natural language processing: a methodical review. J Am Med Inform Assoc 2020; 27(3):457-470.
Article PubMed Google Scholar
Clark K, Vendt B, Smith K, et al. The Cancer Imaging Archive (TCIA): maintaining and operating a public information repository. J Digit Imaging 2013; 26(6):1045-1057.
Article PubMed PubMed Central Google Scholar
Pons E, Braun LMM, Hunink MGM, Kors JA. Natural language processing in radiology: a systematic review. Radiology 2016; 279(2):329-343.
Article PubMed Google Scholar
United States Congress. Public Law 104–191. Health insurance portability and accountability act. 1996. Available online: https://www.govinfo.gov/app/details/PLAW-104publ191. Accessed 10 December 2021.
European Union. Directive (EU) 2016/680 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data by competent authorities for the purposes of the prevention, investigation, detection or prosecution of criminal offences or the execution of criminal penalties, and on the free movement of such data, and repealing Council Framework Decision 2008/977/JHA. 2016. Available online: https://eur-lex.europa.eu/eli/dir/2016/680/oj. Accessed 10 December 2021.
U.S. Department of Health and Human Services. Guidance regarding methods for de-identification of protected health information in accordance with the health insurance portability and accountability Act (HIPAA) Privacy Rule. 2012. Available online: https://www.hhs.gov/sites/default/files/ocr/privacy/hipaa/understanding/coveredentities/De-identification/hhs_deid_guidance.pdf. Accessed December 2021.
Neamatullah I, Douglass MM, Lehman LW, et al. Automated de-identification of free-text medical records. BMC Med Inform Decis Mak 2008; 8:32.
Article PubMed PubMed Central Google Scholar
Dernoncourt F, Lee JY, Uzuner O, Szolovits P. De-identification of patient notes with recurrent neural networks. J Am Med Inform Assoc 2017; 24(3):596-606.
Article PubMed Google Scholar
Norgeot B, Muenzen K, Peterson TA, et al. Protected Health Information filter (Philter): accurately and securely de-identifying free-text clinical notes. NPJ digital medicine 2020; 3:57.
Article PubMed PubMed Central Google Scholar
Lee HJ, Zhang Y, Roberts K, Xu H. Leveraging existing corpora for de-identification of psychiatric notes using domain adaptation. AMIA Annu Symp Proc 2017:1070–1079.
Ferrández O, South BR, Shen S, Friedlin FJ, Samore MH, Meystre SM. BoB, a best-of-breed automated text de-identification system for VHA clinical documents. J Am Med Inform Assoc 2013; 20(1):77-83.
Article PubMed Google Scholar
Steinkamp JM, Pomeranz T, Adleberg J, Kahn CE, Jr., Cook TS. Evaluation of automated public de-identification tools on a corpus of radiology reports. Radiol Artif Intell 2020; 2(6):e190137.
Article PubMed PubMed Central Google Scholar

Download references

Author information

Charles E. Kahn Jr. and Tessa S. Cook are Co-senior authors

Authors and Affiliations

Department of Bioengineering, University of Pennsylvania, Philadelphia, PA, USA
Hannah Horng
Department of Radiology, University of Pennsylvania, Philadelphia, PA, USA
Jackson Steinkamp, Charles E. Kahn Jr. & Tessa S. Cook
Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA, USA
Charles E. Kahn Jr.

Authors

Hannah Horng
View author publications
You can also search for this author in PubMed Google Scholar
Jackson Steinkamp
View author publications
You can also search for this author in PubMed Google Scholar
Charles E. Kahn Jr.
View author publications
You can also search for this author in PubMed Google Scholar
Tessa S. Cook
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Charles E. Kahn Jr..

Ethics declarations

Ethics Approval

This study was approved by an institutional review board.

Consent to Participate/Publication

Informed consent was waived.

Conflicts of Interest

The authors declare no conflicts of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Horng, H., Steinkamp, J., Kahn, C.E. et al. Ensemble Approaches to Recognize Protected Health Information in Radiology Reports. J Digit Imaging 35, 1694–1698 (2022). https://doi.org/10.1007/s10278-022-00673-0

Download citation

Received: 29 March 2022
Revised: 02 June 2022
Accepted: 07 June 2022
Published: 17 June 2022
Issue Date: December 2022
DOI: https://doi.org/10.1007/s10278-022-00673-0

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Ensemble Approaches to Recognize Protected Health Information in Radiology Reports

Abstract

Similar content being viewed by others

Automatic detection of actionable findings and communication mentions in radiology reports using natural language processing

Automated Detection of Radiology Reports that Require Follow-up Imaging Using Natural Language Processing Feature Engineering and Machine Learning Classification

Using BERT models for breast cancer diagnosis from Turkish radiology reports

Introduction

Materials and Methods

Dataset

Annotation

De-Identification Software

Ensemble Methods

Results

Discussion

Summary

Limitations

Future Work

Conclusions

Availability of Data and Material

Code Availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Ethics Approval

Consent to Participate/Publication

Conflicts of Interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Ensemble Approaches to Recognize Protected Health Information in Radiology Reports

Abstract

Similar content being viewed by others

Automatic detection of actionable findings and communication mentions in radiology reports using natural language processing

Automated Detection of Radiology Reports that Require Follow-up Imaging Using Natural Language Processing Feature Engineering and Machine Learning Classification

Using BERT models for breast cancer diagnosis from Turkish radiology reports

Explore related subjects

Introduction

Materials and Methods

Dataset

Annotation

De-Identification Software

Ensemble Methods

Results

Discussion

Summary

Limitations

Future Work

Conclusions

Availability of Data and Material

Code Availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Ethics Approval

Consent to Participate/Publication

Conflicts of Interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation