Introduction

Myocardial perfusion imaging (MPI) is the most common noninvasive cardiac test to evaluate emergency department (ED) patients with suspected acute coronary syndrome (ACS).1 Prior studies have demonstrated a strong association of abnormal MPI studies with adverse cardiovascular events during follow-up.2,3 There is still much to learn regarding the effectiveness MPI and other noninvasive cardiac tests related to patient outcomes, care affordability, and the patients most likely to benefit.4,5 Comparative effectiveness studies to assess the value of MPI or other noninvasive tests in acute care settings pose many challenges, including the high costs of large randomized trials, and the confounding factors associated with non-randomized study designs.4,6 Efficiently capturing the results of large numbers of MPI would provide the information necessary to do large-scale observational studies to answer important clinical questions about the clinical effectiveness, risks, and benefits to patients.

MPI reports document crucial details on MPI testing that are essential to downstream care. Such text-formatted reports are written in human language, which is difficult for computers to process. Natural language processing (NLP) is a subfield of artificial intelligence and computer science focused on the interactions between computers and natural (human) languages. With electronic health records (EHR) being more accessible, NLP has increased use in the clinical field. For clinical research, NLP enabled computers to identify and extract information that is unavailable or inaccurate in structured data.7,8 When compared with manual chart review of medical records, NLP is more efficient and produces more consistent results.9

We previously developed NLP algorithms for the extraction of cardiovascular variables, such as ejection fraction, aspirin, and warfarin usages.10,11,12 Recently, we demonstrated NLP’s ability to identify clinical variables from the electrocardiogram treadmill test (ETT) reports.13

In this study, we aim to derive and validate an algorithm to identify and extract MPI results from MPI reports. We applied the NLP algorithms to a large MPI cohort and described whether NLP-classified risk is associated with an increased risk of cardiac events. Our study builds on previous research,13,14 and leverages a unique dataset of a substantial patient cohort with MPI testing.

Methods

Study Setting

We performed this retrospective cohort study at Kaiser Permanente Southern California (KPSC), an integrated healthcare organization with over 7,600 physicians, 15 hospitals, 234 medical offices, and approximately 1 million annual ED visits. KPSC provides prepaid health care to over 4.7 million racially and socio-economically diverse members in KPSC-owned facilities and contracting facilities. In 2007, KPSC implemented an EHR system based on an Epic Systems platform. All KPSC ED sites use the same troponin lab assay (Beckman Coulter Access AccuTnI+3). ED physicians at KPSC can order noninvasive cardiac testing as part of the discharge and follow-up plan of patients with suspected ACS. In May 2016, KPSC implemented the HEART (History, Electrocardiogram, Age, Risk factors, Troponin) score into routine ED care allowing for a standardized risk assessment for patients with suspected ACS.15 The KPSC Institutional Review Board approved this study.

Study Population

We included all KPSC members aged 18 years or older with an ED visit with clinically suspected ACS resulting in a troponin lab order between 01/01/2015 and 11/30/2018, who underwent an MPI within 30 days of their visit. We excluded patients who were transferred from a non-KPSC hospital or passed away during the ED visit. We also excluded patients without KPSC health plan membership because our dataset does not accurately capture comorbidities and patient outcomes for non-members. MPI studies were identified using Current Procedural Terminology (CPT®) codes (78451-78452) or a referral order linked to the index ED visit.

We obtained demographic information such as age, sex, and race from administrative records; smoking and family history of coronary artery disease (CAD) from self-reported fields in EHR; and medications from our prescription and pharmacy systems. Body mass index (BMI) was measured from ED intake documentation or the most recently available visit. Troponin values were extracted from the lab data. HEART scores calculated at the time of the index ED visit were retrieved from the EHR. Comorbidities were defined using the International Classification of Diseases Ninth/Tenth Revision, Clinical Modification (ICD-9/10-CM) codes included in the Elixhauser score.

MPI Reports

KPSC does not have structured reporting for MPI exams. The MPI reports were dictated or written by the interpreting physicians as unstructured or free-text formats. The MPI reports were saved to the Epic Clarity system running on Oracle Exadata.

Training and Validation Datasets

The necessary size for the validation dataset was 147,16 assuming a prevalence rate of non-normal MPI findings of 13%,17,18,19 an expected maximum marginal error of 0.1, and NLP sensitivity and specificity of 95% compared with a reference standard.13 We created training (n = 120) and validation (n = 150) datasets by random sampling from the study population. Two cardiologists (M.F. and M.S.L.) independently reviewed the MPI reports in the training and validation datasets. The cardiologists were blinded to each other’s reviews and abstracted solely based on the reports. The results of physician review were compared, and discrepancies were resolved by consensus and discussion with the other physician on the research team (R.F.R.). The adjudicated results served as the reference standard against which NLP was compared. We compared the agreements between the two physician reviewers and calculated the weighted Cohen’s κ20 and the intraclass correlation coefficient (ICC).21

NLP Algorithm Development

We developed an NLP-based algorithm to extract information from the MPI reports. The basic NLP processes were described previously.9,10 First, we converted the clinical notes extracted from the EHR system into formats suitable for the NLP search. A pre-processing step removed ill-formatted text and detected sections and sentence boundaries. We created terminologies for MPI-related information. Each report was searched at different scales: section, sentence, and its neighboring sentences. A relationship detection algorithm was applied to identify the associated clinical entities. Negation and temporal relationship algorithms were used to identify and exclude negated, uncertain, historical, and future statements. Negation algorithm handles double negations that are commonly occurred in MPI reports, e.g., “no significant abnormality.” Regular expressions were used to capture the semi-structured information, e.g., left ventricular ejection fraction (EF) values. We extracted information that was commonly available in MPI reports (Figure 1). We derive the final set of variables based on the clinical logic described below. For our study, our main aim was to identify patients with evidence indicating concerns of ACS. Therefore, we categorized our MPI results as follows:

Figure 1
figure 1

Diagram illustrates the NLP process on MPI reports. NLP extracted commonly available information from the MPI reports. The extracted information was used to derive the final set of variables based on the clinical logic. MPI, myocardial perfusion imaging; NLP, natural language processing; EF, ejection fraction

Ischemia an ischemic or reversible defect was identified.

Infarction no definitive ischemic finding, but a fixed or irreversible defect was identified.

Non-diagnostic ischemia or infarction cannot be ruled out due to the presence of artifacts or sub-optimal test quality.

Normal test quality was sufficient to rule out ischemia or infarction.

For ischemic cases, we further identified ischemic location, size, and severity. For unstated defect size, we estimated it based on the number of left ventricular segments involved. We used the 17-segments model to define the defect size as small (involving 1-2 segments), medium (3-4 segments), and large (≥ 5 segments).22 We dichotomized the defect size results into “Small_medium” and “Large,” and the defect severity into “Mild_moderate” and “Severe.” The EF result was categorized into abnormal (≤ 40%), borderline (41%-49%), and normal (≥ 50%).

MPI reports include equivocal findings. For instance, “There is a small sized mild severity, fixed defect in the inferior wall likely due to soft tissue attenuation artifact, although scar cannot be entirely excluded.” Therefore, we built rules to provide a consistent summary interpretation. For example, we used the wall motion and EF values to differentiate defects resulting from ischemia from artifacts.23 If there was no wall motion or EF abnormality, we considered the defect to be an artifact. Since both resting and stress test are needed to differentiate acute ischemia from old infarction, we excluded MPI tests without both resting and stress test results. The NLP algorithm was developed and iteratively improved using the training dataset. We used the programming language Python to pre-process MPI reports. In terminology development, we used word embedding techniques, which capture the underlying and context representation of words and phrases. To extract information from MPI reports, we used Linguamatics I2E. We built a post-processing step, using Python to integrate and finalize the results based on the information extracted.

Criterion Validity of NLP Algorithm

We evaluated the performance of NLP against the reference standard created by double-blinded review and consensus among cardiologist reviewers. We compared the agreements between the NLP results and the reference standard using weighted Cohen’s κ and the ICC. For the multi-class MPI result, we dichotomized it by each class in order to calculate the counts of true positive (TP), true negative (TN), false positive (FP), and false negative (FN) for each class. Then for an individual class, we calculate its sensitivity, specificity, and positive/negative predictive value (PPV/NPV). We calculate the overall performance metrics based on the sum of counts of TP, TN, FP, and FN as micro-averaged scores for the MPI result. The micro-averaged scores are the preferred performance metrics for multi-class classification with imbalanced data.24

Construct Validity of NLP Algorithm

We applied the NLP algorithms to the entire study cohort and compared the patient characteristics and comorbidities among the different MPI results. We treat the MPI result as a nominal variable rather than an ordinal variable. We included 30-day acute myocardial infarction (AMI) or all-cause mortality, from the date of MPI as a descriptive patient outcome, as well as 30-day major adverse cardiac event (MACE) rates, which was the composite of death, AMI, and any coronary revascularization procedures. We calculated P values using the χ2 or the Fisher exact test for all the categorical variables and the Wilcoxon test for all the continuous variables. We set the significance threshold at 0.05. We used SAS version 9.4 (SAS Institute, Cary, NC, USA) for data analysis.

Results

Study Population

Our study population included 16,957 patients with a mean age of 69 ± 12 years; 53 % were women, and 60% were white (Table 1). Over 45% of the study population had a smoking history, 40% were obese, and 38% had a family history of CAD. The mean Elixhauser score was 5.4 ± 3.1. The mean ± standard deviation and median (interquartile range) days from ED to MPI tests were 2.3 (5.6) and 0 (0, 1), respectively. One-third of the patients had a HEART score, and among them, 73.7% and 12.9% respectively had low and moderate-risk HEART scores. The mean troponin level was 0.1 ng/mL. The majority (97.2%) of these patients had a troponin level <0.5 ng/mL (Supplemental Table S1). These 16,957 MPI reports were written by 111 interpreting physicians.

Table 1 Comparison of patient characteristics by NLP-identified myocardial perfusion imaging results

Criterion Validity of NLP Algorithm

The two cardiologists had an excellent agreement on the majority of the variables, with over 90% ICC and κ (Supplemental Table S2). They disagreed more on ischemic severity, with 87.6% ICC and 87.3% κ. NLP had a similar level of agreement with the reference standard as compared to the agreement between the two cardiologist reviewers (Supplemental Table S2). NLP had a perfect match on ejection fraction, over 95% ICC and κ on MPI result, ischemia, and ischemic size, and over 90% ICC and κ on ischemic severity, infarction, and artifact.

Compared with the reference standard (n = 150), NLP achieved 96.7% sensitivity and PPV, 98.9% specificity, and NPV on MPI results using micro-averaged evaluation metrics (Table 2). NLP achieved 100% sensitivity, 99.2% specificity, 96.9% PPV and 100% NPV on identifying ischemia cases. NLP had lower sensitivity (50%) for non-diagnostic cases partly due to the small number of non-diagnostic cases (n = 4). NLP had a lower PPV (89.3%) for identifying infarction.

Table 2 Comparison of NLP to the reference standard (n = 150) for identifying the MPI results

Construct Validity of NLP Algorithm

In the overall study population, the percentages of ischemia, infarction, non-diagnostic, and normal MPI results as identified by NLP were 16.1%, 12.2%, 1.5%, and 70.2%, respectively (Table 1). Compared with the patient group with normal MPI results, the groups with ischemia and infarction findings were more likely to be male, have smoking history, and have cardiovascular-related comorbidities and medications. Patients with ischemia and non-diagnostic findings were more likely to be obese with BMI ≥ 35. Compare with other groups, the non-diagnostic group had the highest mean and median days (3.2 and 1) from ED to MPI test. Over 68% of our sample had an undetectable troponin (< 0.02 ng·mL−1) at the ED encounter, and approximately 50% of the MPIs with ischemia or infarction did as well (Supplemental Table S1). Patients who underwent MPI had more cases of moderate (73.7%) and high (13.4%) HEART scores compared with our general ED patients25 (Supplemental Table S3). Among the ischemia cases, the majority had small- to medium-size defects and mild to moderate severities (Supplemental Table S4).

Overall 30-day event rates for the study cohort were 4.1% for death/AMI and 5.5% for MACE (Table 3). There were associations of increasing 30-day death/AMI and MACE with MPI results from normal (1.4% and 1.6%) to infarction (7.3% and 8.1%), non-diagnostic (10.7% and 14.1%), and ischemia (12.6% and 20.0%).

Table 3 30-day major adverse cardiac outcomes stratified by NLP-identified MPI results after an emergency department visit for a suspected acute coronary syndrome

Discussion

Artificial intelligence (AI), including machine learning (ML) and NLP, has been increasingly adopted within cardiology.26 In cardiovascular imaging, ML has been used to extract imaging variables from raw images and predict outcomes by combining with other clinical variables.27 NLP is another AI-based tool that can identify and extract variables from unstructured text data such as clinical notes and radiology reports. However, NLP is less discussed in cardiovascular imaging, especially in nuclear cardiovascular imaging.

In this study, we developed a computer-based method to identify and extract information from the free-text MPI reports. Compared with the reference standard, the NLP algorithm accurately classified the MPI results. NLP also achieved high accuracy in extracting other clinical variables from the MPI reports, such as ischemic size, severity, artifact, and EF values. To the best of our knowledge, this is the first study to use a computer-based method for abstracting MPI reports. This approach does not depend on any particular clinical features from our institution. Therefore, it will also be applicable to other healthcare institutions.

Based on the NLP-abstracted summary results from the MPI reports, it showed that MPI had good differentiating power in identifying patients at short-term cardiac risk. There were significantly increasing 30-day cardiac event rates with worsening MPI abnormalities. For instance, the patients with ischemia had 9-fold increased 30-day death/AMI rates compared with patients with normal MPI. Compared with our previous studies, the 30-day death/AMI rates for MPI, ETT, and overall ED populations were 4.1%, 0.3%,13 and 0.6%,25 respectively. The type of stress test ordered may reflect the clinician’s perception of a patient’s risk.

Patients with non-diagnostic studies had high 30-day death/AMI rates, even above those with a previous infarct. These non-diagnostic patients were likely heterogeneous since there were a variety of reasons leading to a non-diagnostic MPI. Our results may indicate a need for special attention to patients with non-diagnostic MPI results, who may be at higher than expected risk for adverse events.

Compare with previous studies on ED patients who underwent MPI, the patients in this study were older (mean age 69 vs 52-59), had more cardiovascular-related comorbidities, and a much higher rate of abnormal MPI findings (30% vs 8-20%) (Supplemental Table S5).17,18,19,28,29,30 Conversely, the rate of abnormal findings in our study was at the low end (30% vs 29-49%) compared to studies in non-ED settings.31,32,33 The differences in the patients’ characteristics of our study from other studies might be related to the integrated model in our institution. The findings in our institution might argue against the national trend of using more noninvasive imaging. For instance, while the US observed a 5-fold increase in noninvasive imaging testing from 1998 to 2008, the rate of ACS diagnosis has dropped by half.34 The decrease in abnormal findings may be attributed to testing younger and healthier patients.

Nevertheless, MPI is still an important diagnostic tool for downstream care. The clarity and completeness of MPI reports are crucial for the risk assessment by the referring providers. However, approximately half of the reports do not adhere to recommended reporting standards, and referring providers frequently misestimate the extent of the ischemia.14 Levy et al reviewed a set of sample MPI reports from 44 sites in the Veterans Affairs system.14 They found that less than 5% of the reports had an explicit assessment of ischemic risk. However, nearly all of the reports had the data elements to assess the ischemic risk. We found similar and additional challenges in implementing the NLP method. Even in the same institution, there were substantial differences in the format and quality of the MPI reports. We listed three sample reports from this study in the Supplemental Data S1, S2, and S3. As demonstrated in the sample reports, MPI reports frequently had ambiguous and hedging words that made accurate interpretation difficult (Supplemental Data S1). Although the majority of these reports described the location of the ischemia, they often were not using the standard terms (Supplemental Data S2). For reports with abnormal findings, the ischemic size and severity were not always clearly stated. Despite these challenges, we found that NLP could provide a coherent summary interpretation by synthesizing the data elements presented in the reports. As an automated method, NLP offers low human review costs, higher efficiency, and consistency.

The MPI reports included in this study were based on conventional free-text reporting. This type of report was generated by dictation or typing with full flexibility. Over the past decades, a number of professional societies have promoted standardized and structured reporting of MPI studies.22,35 Structured reports will increase uniformity, reduce variability, and improve readability compared to conventional reports. Since structured reports were still written in natural language, NLP is still necessary to process large numbers of such reports, although it is less challenging to do so. In addition, structured reporting is less likely to resolve all problems in conventional reporting. First, there are variations in structured reporting, such as templates, required components and degrees of standardization.36 Second, despite the promotion of structured reporting, some physicians still favor free-text based reporting.37 Finally, despite improved compliance, the proportion of non-compliant reports still stands at 43% in nuclear cardiology laboratories that applied for accreditation.38 Therefore, in studies performed across multiple institutions, the NLP algorithm must adapt to these heterogeneous types of reports.

Our study has some unique strengths. We validated our algorithm on a large and diverse population within an integrated care system with a comprehensive EHR. Moreover, our prepaid health plan reduced the racial-specific difference in seeking medical care. Furthermore, few studies have focused on the prognostic value of MPI in short-term cardiac events in a population referred from ED with suspected ACS. Our study was able to assess the short-term cardiac outcomes due to the large size of our study population, despite the low event rates.

Study Limitations

Our study has some limitations. MPI results were based on the reading physicians’ interpretations, rather than adjudicated by a core lab. Variations in the accuracy of the test interpretation are expected among physicians. We did not have resources to validate the written MPI reports by re-examining the MPI images. We limited our analyses using the ischemia/infarction related findings since it is often the only information used in clinical decision making by the referring providers. The other variables extracted by NLP could augment the MPI results for a better outcome prediction. Nevertheless, the NLP-extracted variables were not comprehensive. We did not include variables that the MPI reports did not consistently document. Moreover, we limited our analyses on the short-term outcomes since it was the main clinical interest in managing the ED population. Finally, the language and style of reporting can be different across institutions. Our NLP algorithm might perform differently in other testing datasets.

Conclusion

The conventional MPI reports documented by dictation or typing are highly variable based on physician preferences and practices complicating the interpretation of results either by referring physicians, researchers, or by automated abstraction. We developed and validated an automated NLP algorithm to abstract the conventional MPI reports with high accuracy. This computational tool could support a population-based studies of MPI results, which would be otherwise infeasible to capture due to the resources needed for manual chart review of thousands of results. Structured reporting could further assist these efforts.

New Knowledge Gained

Natural language processing provides an efficient way to categorize MPI reports as well as identify and extract other variables from a large number of conventional free-text MPI reports found in electronic health records. Automated abstraction of MPI reports by NLP will facilitate future research to inform how best to manage patients with suspected ACS and to make informed clinical recommendations about which patients may benefit most from MPI.