Introduction

Artificial intelligence (AI) methods to aid diagnosticians in making clinical image interpretation of SPECT myocardial perfusion studies have been reported. Examples include neural networks,1,2,3,4 case-based reasoning,5 support vector machines,6 machine-learning,7 and knowledge-based expert systems.8,9 In expert systems, a knowledge base of heuristic rules is obtained from human experts capturing how they make their interpretations. Yet, to date, no one has developed automatically generated and/or validated natural language structured reports (sRs) that follow society guidelines. The convergence of the high prevalence of heart disease, increased complexity of cardiac imaging techniques, the increasing amount of patient-specific clinical information, and the reduced time the diagnostician has to dedicate to each patient inevitably lead to misdiagnosis and potential patient mismanagement. Hence, AI tools could assist physicians in interpreting and reporting studies at a faster rate and at the highest level of up-to-date expertise.

Here we report on the development and validation of an expert system in 1,000 patients, which applies its knowledge to extracted patients’ left ventricle (LV) perfusion and function information from myocardial perfusion imaging (MPI) imagery to propagate this AI-driven structured9 report (AIsR) following society guidelines.10 Although physicians can easily modify any aspect of the AIsR, here we only evaluate the automatically generated results.

Methods

Study Design

This is a single-center retrospective study designed to compare the diagnostic agreement between an automatically generated AIsR and the clinical rest/stress MPI report dictated by human experts. One of the nine nuclear cardiology (NC) experts dictated these clinical reports. The primary hypothesis was to demonstrate that the per-patient and per-vessel diagnostic performance of the AIsR in reporting hypoperfusion [coronary artery disease (CAD)] and reversibility (ischemia) is comparable (i.e., not inferior) to that of human experts’ clinical reports. Agreement between the AIsR and the clinical report was compared in a 100-patient cohort to the agreement between the same MPI studies interpreted and reported a second time by another independent—10th human expert (VM) who started at Emory after the last MPI study in the trial was acquired (2010) and thus was never privy to their clinical reports. The second goal was to apply the same methodology to the entire 1,000 study group to determine agreement rates between AIsR and experts.

Study Population

One thousand consecutive MPI conventional studies used for this evaluation were obtained from our cardiac database of patients (589 men) referred to Emory University Hospital for clinically indicated attenuation-corrected (AC) rest/stress myocardial perfusion SPECT imaging between May 2008 and March 2010. Note that none of these 1,000 patients was used for the development of the method. Patients imaged with a CZT SPECT camera and/or lower doses during this period were excluded due to differences in technology and changing protocols. Emory’s Institutional Review Board approved this research.

Clinical Data

Age, gender, body mass index, and risk factors data were extracted from the patients’ medical records in Emory’s data warehouse (Table 1). Risk factors mined were hypertension, hyperlipidemia, diabetes mellitus, smoking history, prior myocardial infarction, and prior revascularization. Representative quantitative MPI parameters were also extracted (Table 1) to characterize the population.

Table 1 Characteristics of the study population.

Standard Dual-Detector SPECT

All patients underwent eight-frame ECG-gated 1-day AC low-dose rest, high-dose stress Tc-99m tetrofosmin myocardial perfusion dual-detector SPECT according to the ASNC guidelines.11 Rest-stress doses were determined based on patient’s body weight starting at < 200 lbs [370 MBq rest (10 mCi), 1,110 MBq stress (30 mCi)]. Acquisition times were 14 minutes for rest imaging and 12 minutes for stress imaging. Conventional SPECT projections were obtained utilizing the simultaneous emission/transmission acquisition method that uses a scanning gadolinium-153 line source as the transmission source. The emission transaxial images were reconstructed with an OSEM algorithm with 4 subsets and 10 iterations and a uniform initial estimate. The scatter distribution obtained from the scatter window was used to correct both the scatter from the patient onto the photopeak window and the scatter from the patient onto the transmission energy window. Attenuation maps were reconstructed by means of a Bayesian algorithm with Butterworth filter preprocessing at 0.43 critical frequency and an order of 5.0. The attenuation map reconstruction used 30 iterations with a uniform initial estimate.

MPI Reporting as Reference Standard

In each patient, the detection of hypoperfusion at stress and the presence of reversibility at rest for each major vascular territory reported by AIsR were compared to those from clinical reports generated by one of nine possible NC experts, each with at least 5 years of experience. The clinical interpretations reported were used as the reference standard. The image interpretations for the clinical reports were performed in the routine conventional way. The diagnosticians had full use of Emory Cardiac Toolbox (ECTb) V3.0 images and quantitative results12 as well as all the usual clinical information requested by the interpreter. Neither the nine interpreters had access to the AIsR results from ECTb V4 developed after 2010, nor did any of these nine participate in developing any of the heuristic rules in the program’s knowledge base.

Thus, because of the differences in the approaches, the sum stress score (SSS), and the SDS global and regional values between V3 and V4 could be quite different. Disease was assigned to one or more vascular territory combinations: left anterior descending artery (LAD), left circumflex artery (LCX), and the right coronary artery (RCA).

Interobserver Variability Subgroup

A subgroup of the last 100 consecutive patients was extracted from the 1,000-patients to determine the interobserver variability between experts. A tenth NC expert (VM) recruited to our institution, after the last patient in the study was acquired, performed as an independent reader to determine how the diagnostic variability between human experts reports compared to the variability between experts and the AIsR.

Image Analysis and AIsR Interpretation and Reporting

All MPI studies were reconstructed and reoriented into oblique-axis tomograms using conventional techniques according to ASNC guidelines.11 The studies were then submitted by a technologist to a well-established automatic method of extracting 3D rest, stress distributions of myocardial perfusion, and function.12 The technologist reviewed the processing and manually modified the automatically determined parameters if deemed incorrect, which was done less than 10% of the times and usually at the LV base.

These 3D distributions were then submitted to our iterative method of database quantification implemented in ECTb V4.0. This iterative approach determines the 0 to 4 score for each of the conventional 17 segments using three iterations through the rest and stress AC, and non-AC perfusion, and non-AC function distributions. The iterative steps were as follows: (1) determining the certainty that a segment is abnormal, (2) assigning the score to each of the 17 segments, and (3) using our expert system to modify the score consistent with all the information available for that segment which we call a smart score.

Step 1: determining certainty of segment abnormality

A certainty factor (CF) is determined ranging from − 1 to + 1 for each of the 17 LV segments (− 1 = definitely no count reduction (normal), + 1 = definitely count reduction, and the range from − 0.2 to + 0.2 means the presence of any finding that is equivocal or indeterminate). This CF determination of segment abnormality first calculates the % abnormal probability (Ps) for each segment13 whether a patient’s normalized perfusion distribution (relative blood flow) is lower than that of the normal distribution redeveloped from a previously reported group of normal low likelihood (LLK) patients.14,15 Since the relative blood flow is extracted in terms of number of counts and these counts vary depending on the injected dose, patient size, LV size, and instrument sensitivity (SN), these count distributions for each voxel segment cvs have to be normalized both by the maximal voxel count uptake (Cmax) over the entire LV, and by the total number of LV voxels in each segment (Vs). The normalized count density (n) for each voxel in segment s is given by

$$ n_{\text{vs}} = \left[ {100c_{\text{vs}} } \right]/\left[ {V_{\text{s}} C_{\hbox{max} } } \right]. $$

The value of a cumulative distribution function over all voxels in segment s is given by \( \Upgamma_{\text{ns}}^{\text{pt}} \) as the sum of all normalized count densities for patient pt:

$$ \Upgamma_{\text{ns}}^{\text{pt}} = \sum\limits_{{\text{v}}}{\text{n}}_{\text{vs}} :\;\Upgamma_{{\text{ns}}}^{\text{pt}} = 0\quad {\text{for all}}\;{\text{n}}_{\text{vs}} > {\text{n}}_{\text{vs}} \quad (\Upgamma_{\text{ns}}^{\text{pt}} = 100). $$

Thus, for example, the value of \( \Upgamma_{\text{ns}}^{\text{pt}} \) at 50% in segment 2 in Figure 1 is found by finding the 50 in the x-axis to reach the patients red distribution, the value that you read 55% from the y-axis is \( \Upgamma_{\text{ns}}^{\text{pt}} \)—this represents percentage of the total number of voxels in segment 2 which are ≤ an nvs of 50%. In Figure 1, the red distributions are the normalized cumulative count value stress distributions for each of the 17 segments of the patient shown in the polar map. Note that the patient’s distribution (red) is set to zero after it reaches 100%. This was done to increase the [\( [\Upgamma_{\text{ns}}^{\text{pt}}- \Upgamma_{\text{ns}}^{\text{nl}}]\) difference and thus the discriminatory power of Ps.

Figure 1
figure 1

17-segment results from a patient with LCX vessel disease. Color polar map inset (A) shows the myocardial perfusion distribution for a female patient with LCX vessel disease with the 17-segment model with scores superimposed. The 17 plots correspond to the 17-segment model (B) with the LAD segments on the top, LCX in the middle, and RCA in the bottom rows. The x-axes are the normalized count values, and the y-axes are the normalized voxel frequencies with those count values. The white distributions are the averaged normalized cumulative distributions from 20 female patients with low likelihood of CAD. The red distributions are the normalized cumulative count value distributions for the patient shown in the polar map. Note that red distributions to the left of the white normal ones represent increasing certainty of abnormality. Also note how well behaved is the shape of each of the patient’s segmental distributions even though it represents a small portion of the LV from just one patient

The white distributions \( \Upgamma_{\text{ns}}^{\text{nl}} \) are the cumulative distribution functions from all normal patients used to create this specific nonparametric normal database. The probability Ps, is then determined for each of the 17 LV segments whether a patient’s tracer distribution is lower than that of the normal distribution as

$$ P_{s} = 100\sum\limits_{\text{n}}\left[ \Upgamma^{\text{pt}}_{\text{ns}} - \Upgamma^{\text{nl}}_{\text{ns}} \right]/\Upgamma^{\text{pt}}_{\text{ns}} . $$

Note that Ps is a function of nvs. Also, note that to determine the probability Ps, we are summing over all available n’s (i.e., all available samples of normalized count values) that is equivalent to summing all n’s from 0% to 100%. These Ps are converted to CFs by a transformation from [0, 100] → [− 1, 1] using Shannon’s information theory.16 In this information approach, CF is obtained by using a transformation function between percent (Ps) of a segment being abnormal and uncertainty U = (1 − CF) as

$$ U = -\sum\limits_iP_{si} \log_{2} P_{si} , $$

where i is the potential number of states: in this case 2, normal and abnormal. For example, in Figure 1, for segment 6, P6 = .89 (or 89%), hence U = − (.89 log2 .89 + .11 log2 .11) = .50, and therefore CF is abnormal as 1 − .5 = .5, consistent with this hypoperfused (abnormal) segment. For segment 8, on the other hand, the patient’s distribution (red) is inside the normal distribution (white), and thus, the CF obtained is negative, which indicates that the segment is normally perfused. This allows CFs to range from − 1 to + 1. CFs are calculated for each segment and for each quantitative parameter used as input to the AIsR. This is a nonparametric approach as no assumptions are made as to the properties of the normalized count distribution (usually incorrectly approximated as Gaussian).

Step 2: assigning a score to each of the segments

This step converts the CF value for each segment into a score (0 to 4, Figure 2). All segments with a normal CF (< − .2) are given a score of 0. The score for each abnormal (CF > .2) or equivocal (− .2 < CF < .2) segment depends on two parameters: (1) the type of distribution (stress, rest perfusion; perfusion reversibility; AC vs non-AC, supine vs prone, stress, rest thickening, thickening reversibility) and (2) the magnitude of the parameter (% uptake for perfusion, % thickening for thickening). These CF settings were done at three different levels (modes) of SN/specificity (SP) settings: (1) high SP, where an equivocal CF in the AIsR was set to normal; (2) high SN, where an equivocal CF in the AIsR was set to abnormal; and (3) tradeoff SN/SP, where the lower half of the equivocal CF range (− .2 to 0) was set to normal and the upper half (0 to .2) to abnormal.

Figure 2
figure 2

Combined slices/polar map displaying the patient with reversible lateral wall perfusion defect from Figure 1. Stress (top)/rest (bottom) SPECT attenuation-corrected slices, rotating projections, transmission slices, and 17-segment smart-scores. Note three contiguous segments in the lateral wall of the stress polar maps each with a score of 2 (SSS = 6) corresponding to 9% of the LV hypoperfused. Also note that circles around the stress perfusion scores (inset A) signify that the original scores in Figure 1A were modified by the expert system

A set of scores is determined for each segment in each distribution and then are merged into one set of results for stress perfusion, rest perfusion, reversibility perfusion, stress thickening, and rest thickening. The merger takes place such that the most normal score for each segment in each distribution is retained. For example, if the scores for segment 16 in the stress perfusion distribution is a 2 for non-AC, -and a 0 for AC (or prone) the combined score retained is a 0.

Step 3: determining smart-scores and AIsR generation

Here all sets of scores from step 2 are used as input to our expert system. This is a Bayesian inference engine forward chaining our MPI knowledge base of interpretation and reporting heuristic rules, similar to our previous reports8,9 following the well-established expert system methods.17 This expert system uses these input scores to determine the certainty of the location, size, shape, and reversibility of both the perfusion defects and thickening abnormalities to infer the certainty of the presence and vascular location of CAD. This information is then transmitted to the AIsR in natural language text. One main difference between our current expert system and our previous one9 is that now all information for each segment is weighted to modify each segmental score during this iteration and the AIsR follows ASNC guidelines for reporting.18 Thus, for example, a segment that exhibits a fixed perfusion defect in the non-AC distributions is more certain to be fixed if it is also fixed in the AC distributions and even more certain if the segment is thickening abnormally. Once all perfusion and function smart-scores (Figure 2A inset) and pertinent prespecified data elements [example LVEF, trans-ischemic dilatation (TID), etc.] along with their CF values are determined, they are exported as a highly structured object which is then imported by the AIsR. These exported data elements are mapped onto the existing data entry fields within the AIsR. When the user begins generating the report, all of the mapped input entry fields are automatically prepopulated including the smart-scores data generated by our expert system.

All the natural language text is conditionally generated by the reporting module of the system. In brief, take, for example, the results in Figure 3 and the AIsR report in Figure 4A. Specifically consider the conclusion in both figures “the apical lateral segment is completely reversible.” Before reaching the report, the nonparametric statistics combined with the expert system portion of the AIsR has determined CFs for each possible state (categories). In this case of apical lateral segmental reversibility, it has determined a CF that the segment is completely reversible, another CF that it is partially reversible, another CF that it is minimally reversible, and another CF that it is fixed. The natural language generator reads these states and chooses the one with the highest CF as the condition to report, in this case completely reversible.

Figure 3
figure 3

Automatically generated AIsR perfusion subreport of patient from Figure 2. Note concordance with the oblique slices and smart-scores. All drop-down arrows indicate a parameter that can be modified by the nuclear cardiology expert before it reaches the final report (not used for this validation)

Figure 4
figure 4

Findings and impressions extracted from AI-structured report (A) and actual excerpts of the clinical report (B) for the MPI study shown in Figures 1 to 3. Note concordance in the presence and the location of hypoperfusion associated with ischemia

Statistical Analysis

All studies were classified as normal (definitely normal or probably normal) or abnormal (definitely abnormal or probably abnormal) based on the report describing the presence of one or more stress perfusion defects. To test the primary hypothesis the methodology previously reported by us to test for noninferiority was used.14 The difference between two population proportions from a single sample19 was used to test if there were differences in reporting agreements between AIsR-expert to independent-expert. If AIsR findings are equivalent to expert findings, the expected difference between the AIsR findings agreement to independent-expert agreement is zero. The primary analysis tested the null hypothesis of equivalence of AIsR-expert agreement to independent-expert agreement (no agreement rate reduction) vs inferiority (a reduction of > 0%). A 95% confidence interval (CI) for the difference between AIsR-expert agreement rates to independent-expert agreement rate was calculated and the null hypothesis rejected if the upper limit was below 0% with a corresponding one-tail P value less than .05. Interobserver agreement between AIsR findings and expert findings for all 1,000 MPI studies was measured using percent agreement (accuracy) and Cohen’s κ value. McNemar’s test was used to test the statistical differences in accuracy in the 1,000 MPI studies between each of the three SN/SP modes. To test whether there were differences between the MPI studies from the 1,000 patients and the 100-patient cohort as to the prevalence of CAD, ischemia, and AIsR agreement rate, the Medcalc χ2 comparison of proportion was used. A P < .05 was considered significant for all comparisons.

Results

Interobserver Analysis

The human experts’ reporting of the 100-patient subgroup resulted in 17 patients with CAD and 83 without. Of the 17 patients diagnosed with CAD 9 were reported to be ischemic. The breakdown of stress hypoperfusion by vascular territory in the 17 CAD patients were as follows: 8 LAD, 10 LCX, and 5 RCA. The breakdown of reversible ischemia by vascular territory in the 9 ischemic patients were: 6 LAD, 5 LCX, and 1 RCA. The overall agreement rates, P values, agreement differences, and 95% CI for each of the validated reported categories are shown in Table 2. At the high SP level, there were no statistical differences in the agreements between the AIsR findings/impressions compared to the experts’ findings/impressions when compared vs the independent (10th) reader findings/impressions vs the experts in reporting the same studies. The finding of no statistical difference was true for the reporting of CAD (P = .33) or ischemia (P = .37). There were statistical differences for the tradeoff SN/SP level (CAD P = .01, ischemia P = .03) and even more differences for the high SN level (CAD P = < .001, ischemia P = < .001). At the high-SP level the 95% CI is above 0% for all categories (i.e., the AIsR findings are not inferior to the human expert reports) whereas they are below zero at four of eight categories at the tradeoff level and all eight categories for the high-SN levels.

Table 2 Agreement between automated smart-report results and human experts at three different sensitivity/specificity modes (n = 100).

AIsR Agreement with Experts

The nine human experts reporting of the 1,000-patient population resulted in 247 patients with CAD and 753 without. Of the 247 patients diagnosed with CAD, 120 were deemed ischemic. The breakdown of stress hypoperfusion by vascular territory in the 247 CAD patients revealed 135 LAD, 103 LCX, and 85 RCA. These included 194 patients with single-vessel disease, 169 with double-vessel disease, and 117 with triple-vessel disease. The breakdown of reversible ischemia by vascular territory in the 120 ischemic patients revealed 61 LAD, 63 LCX, and 28 RCA. There were no significant differences between the 100-patient cohort used to test the noninferiority of AIsR vs expert and the 1,000-patient study group used to determine agreement rates between AIsR and experts. The categories tested were prevalence of CAD (347/1,000 vs 27/100; P = .11), prevalence of ischemia (120/1,000 vs 9/100; P = .37), agreement rate for CAD (820/1,000 vs 85/100; P = .45), and agreement rate for ischemia (880/1,000 vs 89/100; P = .77). All statistical comparisons were done using AIsR’s high-SP mode.

Figure 2 depicts images and smart-scores in a female patient with reversible defects in the LCX coronary territories with the corresponding smart-reports shown in Figures 3 and 4A. Figure 4B shows the findings and impressions of the actual clinical report.

Figure 5 shows agreement results of AIsR-experts for the entire 1,000 patient group using the reported expert clinical read as the reference and compared for the three levels of SN/SP. These agreements are shown with regard to detection of stress-induced hypoperfusion and stress-induced ischemia. Note that for both the CAD and ischemia category, the high SP level yielded the highest accuracy and SP across global and regional results. These accuracies were determined to be statistically significant across all comparisons for global and regional hypoperfusion and reversibility. Table 3 shows percent agreement, κ agreement values between the AIsR and the experts’ impressions of CAD and ischemia in the 1,000 MPI studies. These κ values ranged from 32.3 to 51.9 corresponding to a range from fair to moderate agreement as might be expected in the variation of clinical reports amongst nine different experts.

Figure 5
figure 5

Diagnostic performance of the AI-structured report in reporting stress-induced hypoperfusion as indicative of CAD (top row) and reversibility at rest as indicative of ischemia (bottom row). Results for the modes: high specificity (green bars); sensitivity (SN)-specificity (SP) tradeoff, (red bars); and high sensitivity (blue bars) results are shown for agreement (i.e., accuracy: left column), specificity (middle column), and sensitivity (right column) (*P < .001). The labels CAD and ischemia in the abscissa of each graph refers to global findings regardless of vascular territory

Table 3 Agreement, κ, and 95% CI results for the automated AIsR using high-specificity mode and the human experts reports as reference standard (n = 1000)

Discussion

We developed and validated the diagnostic performance of an MPI natural language reporting system that utilizes nonparametric relative perfusion and function quantification as input to our expert system to interpret the study and generate the report. This is the first study that compares automatically generated MPI natural language reports to actual clinical reports.

Our results show that the reporting of CAD (hypoperfusion at stress) and ischemia (reversibility at rest) from our automatically generated AIsR is not statistically inferior from that of experts when a high-SP mode is used (i.e., equivocal = normal) and the reporting of other experts is used as the reference standard. Importantly this high-SP mode yielded the highest accuracy in our extensive population. It should not be surprising that AIsR best agreed with the experts in the high-SP mode since this indicates the human image interpretation trend being adjusted to the drop in the prevalence of abnormal studies to 25% at our institution (also in this population) similar to trends reported by others20 and reported as low as 9% at other major institutions.21 These findings are also consistent with those reported from a meta-analysis of 49,000 patients demonstrating diagnostic performance for referral bias corrected MPI (similar to echocardiography) of 99% SP and 38% SN (from 69%, 85% uncorrected, respectively).22

Strength of the Approach

This is the first report showing full integration between an image analysis system and structured reporting: to serve a critical need in modern imaging practice. Although the best agreement existed when the high-SP mode was selected, this choice is easily modified to a high-SN level (or tradeoff level) when the AIsR is used to report on patients from a high-risk population such as diabetes. Newly reported here is the determination and use of our 17-segment smart-scores. This novel scoring uses a nonparametric normalized count distribution applied to information theory to generate a certainty of abnormality. This certainty for each segment is modified according to all the available perfusion and function information for that segment including rest, stress, changes between stress and rest, AC and non-AC images, and prone images. Although not validated here, the diagnostician is allowed to change manually any of the scores that in turn would modify the report if needed. Importantly, as previously reported,23 the expert system tracks all steps in generating the report as a justification, which may be used by the diagnosticians to decide whether they agree or not with the findings or impressions in the report. This is an important benefit of expert systems over conventional neural net or machine-learning approaches. Another benefit of the expert system approach used here is that, compared to other AI approaches, only the 40 normal patients used for database generation were needed to train the system as most of the training comes from the cumulative experience of the experts.

Comparison of AIsR to PERFEX

As described in the “Methods” section, we had previously developed and validated a decision support expert system to assist NC physicians with the image interpretation process.8,9 There are several differences between that system (PERFEX) and the one reported here. PERFEX divided the LV into 32 segments; AIsR uses the standard 17-segment system. PERFEX depended on Gaussian distributions and statistics to determine normality and abnormality criteria; AIsR uses nonparametric statistics. PERFEX did not use the global or regional functional information to reach its conclusions; AIsR integrates the functional information into all its conclusions. PERFEX did not use its conclusions to modify the ECTb results; AIsR uses its knowledge base and the available quantitative information to modify the original segmental scores into smart-scores. If AC was performed, PERFEX would provide a separate interpretation for the AC study and one for the non-AC study; AIsR integrates both into one set of scores and one conclusion. If there was, also a prone study performed, AIsR would also integrate it. This integration takes place by trying to mimic in the code how human experts use the information. Before the integration is done AIsR determines segmental scores separately for each of the diagnostic categories considered: stress perfusion, rest perfusion, reversibility, and thickening. After these individual scores are determined, AIsR integrates the information into a meta-analysis module. Therefore, if an MPI study had AC, non-AC, and prone studies performed, AIsR would use the most normal score for that segment. If the same segment exhibited reversibility, AIsR would then modify the score using Bayesian statistics and the strength of the information (i.e., how much reversibility was present). Similarly, if the same segment exhibited abnormal thickening, then AIsR would again modify the score using the same approach as the one with reversibility. Perhaps the most obvious difference between PERFEX and AIsR is that AIsR propagates its conclusions into a sR.

Reference Standard

Since AI systems have to be “trained” and validated with both input images and accepted output interpretations, the question of what to use as the reference standard often arises. Use of invasive coronary angiography or clinical outcome as the gold standard for training and validating is often mentioned for an MPI AI system as attractive goal, but it misses the point of these systems, that is, to interpret studies with the same level of expertise as experts. Moreover, using invasive catheterization as a gold standard is biased by the referral pattern of abnormal MPI studies to catheterization as well as by the discrepancies in comparing physiologic results to anatomic ones. Outcome is certainly an important measure, but in MPI, coronary angiography and outcomes as gold standards are confounded by the fact that the scan interpretation (e.g., ischemia or no ischemia) has a major impact on the referral to the catheterization lab or the clinical outcome (intervention vs observation); consequently, these gold standards are biased. Simply stated, the interpretation of the study affects the treatment, and the treatment affects the outcomes thus biasing the outcomes as a reference standard. Thus, the practice of using interpretation of the MPI studies by experts is an acceptable approach that has been used by other researchers and ourselves.9,24

Limitations

First, all the data used for this evaluation were obtained retrospectively from one center. Second, we had to extract manually the needed diagnostic information from the clinically dictated reports to use as the reference standard. Third, all the clinical reporting was performed by Emory experts. Although these experts were trained at different institutions, it could be argued that over time, they tended to read similarly and perhaps different from readers from other institutions. Fourth, although the AIsR uses standardized reporting guidelines, we did not compare the size and severity of the hypoperfused or reversible areas between the experts and the AIsR, but only studied whether these were present and if so in which vascular territory. This is because in part when the clinical reports were generated reporting guidelines were not being strictly applied by the experts. Fifth, we also chose not to report here the clinical reporting agreements as to functional variables. Although these functional parameters were used in the generation of the smart-scores, these variables are quantitative and straightforward in how they are usually reported and therefore not compared for simplification. Sixth, although we have previously integrated patients’ clinical information with their imaging results in order to improve diagnostic accuracy,25 this was not attempted here, as it would require either manual input and/or EMR interfaces with hospital systems that now would limit the applicability of this AIsR. Seventh, the agreement in reporting between the AIsR in the high-SP mode and our clinicians reflects the current reduced prevalence of disease (25%) of our patient referral pattern. In other scenarios (such as other countries) where the prevalence of disease is much higher than 25%, different results could have been obtained. This is the rationale that motivated us for allowing the AIsR to switch easily between modes such as high SN and SN/SP tradeoff mode. Finally, although the use of AC is not a limitation but an attribute that reduces the complexity of image interpretation, results of applying our approach to a large study population without AC (or prone imaging) cannot be predicted by the present study.

New Knowledge Gained

Nonparametric statistics can be used to determine certainty that a regional parameter of LV perfusion and/or function is abnormal. Due to apparent reduced prevalence of CAD in populations of patients undergoing MPI, automated diagnostic systems agreement with experts improves when set to analyze images at high-SP settings.

Conclusions

Automatic sRs from computer-assisted interpretation of rest/stress myocardial perfusion SPECT studies by an AI expert system when operating at a high-SP level statistically agree with the interpretations of NC experts and exhibit diagnostic accuracy consistent with that of experts when their clinical reports are used as the reference standard.