Radiology reports are an important part of assessing disease severity and response to therapy, and multiple radiology societies are focused on improving reporting practices [16]. One proposed method involves standardizing report templates, also called structured reporting (SR). Potential benefits of SR include more consistent reporting of important scan features [3], better utilization of scan interpretations for research purposes [7, 8], and catering to referring clinician preference [9, 10]. Conversely, concerns regarding diagnostic accuracy, completeness, and time required to complete templates have been raised [1113]; consequently, SR remains a divisive issue in many radiology practices.

SR has been found to be useful when documenting complex disease processes, such as local staging of pancreatic cancer [3]. Inflammatory bowel disease (IBD), another complex disease process, may also benefit from a standardized report template [14]. In particular, SR may be helpful for computed tomographic enterography (CTE), the CT gold standard imaging examination for diagnosis of and monitoring disease activity in IBD [15]. Cross-sectional imaging assists clinicians’ assessment of response to therapy and aids in planning surgical intervention [16, 17], and SR may promote more consistent data exchange and ultimately facilitate communication between radiologists, gastroenterologists, and surgeons. In addition, SR has the potential to assist radiologists in remaining consistent across examinations, particularly when different radiologists are interpreting scans from the same patient over time.

Increased disease complexity has the potential not only to hinder communication across provider types and between radiologists, but it also proves challenging for radiology trainees. Although infrequent, errors made by radiology residents tend to occur when interpreting more complex examinations [18, 19]. It has been proposed that SR may help residents develop a search pattern and improve accuracy; however, results have been mixed [11, 20, 21].

As IBD is a complex process involving multiple bowel segments as well as findings outside the bowel, we hypothesized that using SR would result in a more complete assessment of the abdomen and pelvis and that SR might help trainees improve the consistency and accuracy of their reporting. Thus, the purpose of this investigation was to objectively compare report content and accuracy across multiple training levels of structured reports (SR) versus non-structured reports (NSR) for CTE examinations of patients with IBD.

Materials and methods

Our institutional review board approved this Health Insurance Portability and Accountability Act-compliant, retrospective study, and the requirement for informed consent was waived.

Patient population

Our institution is a tertiary referral center, and the picture archiving and communication system was searched for CTE examinations performed between July 1, 2013 and July 30, 2015. A random selection of patients (18 years of age or older) with findings of active inflammatory bowel disease included in the impression on the finalized report was included. Patients with negative CTE examinations were excluded. Demographic data were reviewed in the electronic medical record, and the final patient population included 30 subjects (15 male and 15 female), aged 18–77 years (mean = 41.9 ± 14.7 years). All 30 subjects had a history of inflammatory bowel disease; 25 had Crohn’s disease, 3 had ulcerative colitis, and 2 had unspecified disease (pathology suggesting IBD but without definite categorization).

Imaging protocol

All CT examinations were performed on 64-detector multi-detector CT scanners (Discovery HD750, GE Healthcare, Milwaukee, WI, USA or Somatom Definition Flash, Siemens Healthcare, Erlangen, Germany). Scan parameters were as follows: tube voltage 120 kVp; automated tube current modulation with a noise index of 22 (GE systems) or a reference mAs of 200 mAs (Siemens systems); and 500-ms gantry rotation with 0.625-mm collimation. Images were reconstructed as 5-mm contiguous axial images and 3-mm contiguous coronal reformatted images were created. Prior to scanning, patients ingested 450–900 mL of low-Hounsfield unit enteric contrast (VoLumen [0.1% barium suspension], Bracco Diagnostics, Princeton, NJ, USA) over a 60-min period, based on patient tolerance. Intravenous contrast enhancement was obtained with 150 mL iopamidol (300 mg Iodine/mL; Isovue-300, Bracco Diagnostics) injected at 3 mL/sec. Images were obtained during the enteric phase with a 45 s scan delay.

Report template

The structured reporting template for the CTE examination was derived from a template proposed by the Society of Abdominal Radiology Crohn’s disease focus group and by Baker et al. with minor modifications [16]. The final version contained 14 key features of IBD assessment (Fig. 1). No drop-down lists or default responses were included; blank spaces were present for each item. Of note, the presence of steatorrhea (low-density fecal material) was included, as its symptoms can mimic those of active IBD [22, 23].

Fig. 1
figure 1

Structured report template for interpretation of CT enterography for patients with inflammatory bowel disease

Imaging analysis

Nine radiologists independently reviewed the datasets: three fellowship-trained abdominal radiologists (faculty) with 6–11 years’ post-fellowship experience, three abdominal imaging radiology fellows, and three senior radiology residents (R3–R4). Each reader was first provided the 30 CTEs and asked to interpret them using a typical free text report. They were not instructed on any specific features to report, nor were they provided any definitions for feature criteria (i.e., they were not given definitions for what constitutes ‘stricture’). After a wash-out period of at least 4 weeks, they were provided the same 30 examinations, this time to interpret using the provided structured report. In total, the 30 CTE examinations were interpreted by all nine readers, for a total of 270 NSR and 270 SR. All reports (NSR and SR) were constructed using the PowerScribe 360 dictation system (Nuance Communications, Inc., Burlington, MA, USA).

Quantitative analysis

Both the NSR and SR were then assessed in consensus by a 4th year radiology resident and a fellowship-trained abdominal radiologist with 5 years’ post-fellowship experience for the documentation of the presence or absence of 15 key reporting features. In the NSR, mention of a key feature qualified as documentation, whether the finding was positive or negative (e.g., “no abscess” counted the same as “abscess present”). Similarly, in the SR, documentation in the field of either the presence or the absence of a feature qualified as documentation.

Five of the key features were then analyzed for accuracy: multifocal disease, stricture, fistula, fluid collection, and perianal disease. The reference standard for each case was established in consensus by two unblinded abdominal radiologists with 5 and 12 years’ post-fellowship experience, respectively.

Reader experience

The participating radiologists (the readers) were administered a survey following the completion of the study. The survey featured four subjective questions regarding the radiologist’s experience with SR and his or her opinion of it moving forward. The questions were as follows: (1) Using the structured report was 1 = easy, 2 = difficult but got better with practice, or 3 = consistently bothersome and fatiguing. (2) Moving forward, how would structured reporting for CTE of IBD affect your productivity in the reading room: 1 = definite increase, 2 = slight increase, 3 = no change, 4 = slight decrease, or 5 = definite decrease. (3) Overall, what is your report type preference for reporting CTE of IBD: 1 = strongly prefer structured, 2 = slightly prefer structured, 3 = no preference, 4 = slightly prefer non-structured, or 5 = strongly prefer non-structured. (4) After participating in this study, how likely are you to use structured reporting for CT of other disease processes: 1 = definitely more likely, 2 = somewhat more likely, 3 = neutral, 4 = somewhat less likely, or 5 = definitely less likely.

Qualitative analysis

De-identified CTE reports from the three faculty readers were provided to three referring physicians (a pediatric gastroenterologist with 7 years of post-fellowship experience, a gastroenterology fellow, and a colorectal surgeon with 12 years of post-fellowship experience) who treat patients with IBD in their clinical practices. A total of 30 reports (15 NSR [5 per reader] and 15 SR [5 per reader]) were randomly selected for review by each clinician, who independently evaluated the reports based on three criteria: ease of information extraction (0 = difficult, 1 = needs some effort, or 2 = easy), clarity of patient anatomy (0 = unclear, 1 = mostly clear, or 2 = very clear), and ability to identify disease phenotype [17, 24] (1 = yes, 2 = no).

Statistical analysis

Results were summarized using descriptive statistics. For the quantitative analysis, a mixed-effects linear regression model (to account for the correlation between observations due to using the same cases/readers) was used to determine whether there was a difference in the number of features per report between NSR and SR and across training levels. A paired t test was used to determine differences in each individual feature for NSR versus SR. A two-sample f test was performed to assess differences in the variance between the numbers of key features reported between report types.

To assess the association between report type and accuracy, we used generalized linear mixed-effects models with readers and cases as random effects. The same analysis was repeated to investigate such effect for residents only, fellows only, and faculty only.

To evaluate the agreement between readers, we used the Fleiss’ Kappa coefficient. We used the bootstrap approach (10,000 bootstrap samples from cases) to estimate confidence intervals for the agreement and to determine whether the difference between the Fleiss’ Kappa coefficients for the SR and NSR was statistically significant. Coefficients were then categorized according to the following scale: 0.01–.20 = slight agreement; 0.21–.40 = fair agreement; 0.41–0.60 = moderate agreement; 0.61–0.80 = substantial agreement; 0.81–1.0 = almost perfect agreement.

A χ 2 test was used to compare the referring clinician preference in regards to SR versus NSR. Descriptive statistics were calculated for the post-study survey of participating radiologists.

Statistical analysis was conducted using Microsoft Excel (Microsoft, Redmond, WA) and R version 3.2.1 (2015/06/18, www.R-project.org). P values of less than 0.05 were considered significant.

Results

Quantitative analysis

Interpretations using NSR documented the presence or absence of 8.2 ± 2.2 key features (range 4–14), while SR documented 14.6 ± 0.5 features (range 13–15) (p < 0.001). Increased reporting of both pertinent positive and pertinent negative features comprised this difference, with NSR documenting 5.1 ± 1.5 positive findings compared to 5.9 ± 1.7 for SR (p < 0.001), and 3.2 ± 1.8 negative findings for NSR compared to 8.7 ± 1.6 for SR (p < 0.001). There was also a decrease in the variability in the number of key features reported, with NSR having a standard deviation of 2.2 compared to 0.5 for SR (p < 0.001). When using NSR, there was a gradient from residents (7.5 ± 2.3 features) to faculty (9.1 ± 1.7 features), though this was not significant (p = 0.32). With SR, readers across training level reported approximately the same number of features (Fig. 2).

Fig. 2
figure 2

Number of key features reported by report type and training level. All three levels of training showed a significant increase in key feature reporting when using the structured report

Multiple key features were consistently omitted from the NSR, and the use of SR resulted in increased documentation of many of these features. Specifically, 13 of 15 key features were documented significantly more in the SR including fluid collection and perianal disease. The remaining key features that were included more frequently in the SR are outlined in Table 1. Disease location was mentioned with similar frequency between report types. Only wall thickening was mentioned less frequently in SR, as this was not specifically included in the template.

Table 1 Key features described in non-structured versus structured reports

Accuracy was assessed for a subset of 5 key features (multifocal disease, stricture, fistula, fluid collection, and perianal disease) for both reporting styles (Table 2). In general, accuracy when using SR was not significantly improved compared to NSR. For reporting on fistula, fluid collection, and perianal disease, accuracy rates were not significantly different for all readers combined or within training levels. There was a small but statistically significant improvement in accuracy for describing multifocal disease, with accuracy increasing from 76% (205/270) with NSR to 83% (224/270) with SR (p = 0.01). This overall improvement was manifested by increased accuracy for faculty (84% (76/90) SR vs. 72% (65/90) NSR, p = 0.01) but not for fellows or residents. There was a trend toward increased accuracy for reporting on stricture (60% (163/270) SR vs. 54% (147/270) NSR, p = 0.06).

Table 2 Reporting accuracy between NSRa and SRb for a subset of key features, across training levels (numbers in percent accurate)

The analysis of interobserver variability for the subset of five features showed increased agreement using SR in some instances. For example, agreement was ‘fair’ for fistula in NSR (k = 0.21, 95% CI 0.10–0.31) and increased to ‘substantial’ for SR (k = 0.62, 95% CI 0.43–0.79) (p < 0.001). Agreement increased from ‘slight’ (k = 0.01, 95% CI −0.09 to 0.07) to ‘fair (k = 0.29, 95% CI −0.03 to 0.54) for perianal disease (p < 0.001), from ‘slight’ (k = 0.15, 95% CI 0.06–0.25) to ‘fair’ (k = 0.35, 95% CI 0.18–0.50) for stricture (p < 0.001), and from ‘slight’ (k = 0.04, 95% CI −0.03 to 0.14) to ‘moderate’ (k = 0.59, 95% CI −0.02 to 0.89) for fluid collection (p = 0.25). Agreement for multifocal disease was ‘substantial’ and did not change (k = 0.68 NSR vs. k = 0.70 SR, p = 0.59).

Qualitative analysis

The referring clinicians preferred SR with regard to ease of information extraction (mean score SR 1.7 vs NSR 1.2 on a scale of 0–2, p < 0.01) (Table 3). For SR, the clinicians rated 35 of 45 reports (78%) as “easy” regarding ease of information extraction, compared to only 19 of 45 reports (42%) for NSR. There was also a small but significant preference for SR when trying to identify disease phenotype: clinicians rated 44 out of 45 (98%) SR as being helpful for phenotype identification, compared to 39 out of 45 (87%) NSR. The clinicians did not demonstrate a significant preference when evaluating reports for clarity of anatomy.

Table 3 Subjective assessment of reports by referring clinicians

Reader experience

When asked about their experience using SR, two of the nine readers (22%) rated it as “easy,” six (67%) rated it as “difficult but got better with practice,” and only one (11%) rated it as “consistently bothersome and fatiguing.” In regards to productivity for reporting CTE of IBD when using SR, four readers (44%) stated using SR would either “definitely” or “slightly” increase productivity, while the other five readers (56%) did not think their productivity would be affected. None of the readers felt SR would decrease their productivity. When asked about using SR for reporting CT of other disease processes, five readers (56%) stated that after completing this project they were “definitely” or “somewhat” more likely to use SR. Only one reader (11%) said that completing this project made him or her “somewhat less likely” to use SR in the future. Out of 36 total responses, only four (11%) were negative toward SR (Fig. 3).

Fig. 3
figure 3

Reader responses to a survey regarding structured reporting; the majority of responses were neutral or positive

Discussion

SR is an advantageous reporting method to allow thorough documentation of complex disease processes such as IBD. The results of this study demonstrate that when using non-structured reports to describe the findings of inflammatory bowel disease on CTE, radiologists across all levels of training consistently omit numerous key descriptors and pertinent negative findings. However, when using structured reports, radiologists detailed nearly double the number of key features in a more consistent fashion, with both pertinent positive and negative findings reported more frequently. In addition, our referring physicians showed some subjective preference for SR. However, despite improved documentation of findings, overall accuracy and accuracy across training levels were minimally affected.

These data support the growing evidence that SR allows for increased reporting of key features of complex disease processes. Improved reporting using SR has already been described in multiphasic CT for pancreatic cancer [3] as well as in MRI for rectal cancer staging [25], both being complex diseases. In our study, increased reporting was manifest through higher numbers of both positive and negative findings. Radiologists generally identify positive findings regardless of report type; however, even a small increase in positive finding reporting could be clinically significant, and our radiologists reported nearly one full additional positive finding when using SR. On the other hand, reporting of pertinent negative findings is also important [26, 27], particularly when describing complex disease processes like IBD. In our study, there was a nearly threefold increase in the number of pertinent negative findings described when using SR. Features such as fluid collection, fistula, stricture, and perianal disease were all described significantly more often in SR, even when they were absent. This improvement suggests that unless prompted some pertinent negative findings may be omitted from a report, which could lead to confusion or a sense of incompleteness in the report.

While more features were reported overall, we did not find a significant impact on accuracy when examining a subset of key features including stricture, fistula, fluid collection, or perianal disease. The lower accuracy rate for stricture suggests that the diagnosis is challenging, and when coupled with high interobserver variability, these data highlight the subjectivity and difficulty of reporting stricture on CT. In contrast, multifocal disease was a feature where the use of SR was associated with greater accuracy, though the increase was modest. Our findings in regards to accuracy are similar to those in prior works: Powell et al. found no increase in accuracy when readers used SR to interpret maxillofacial CT [28], and Lin et al. described mixed results regarding accuracy with the use of SR for cervical spine CT [20]. It appears that while more descriptors may be provided with SR for CTE, accuracy may not improve.

When using NSR, there was a gradient across training levels while assessing the number of key features reported. As one might predict, residents reported the fewest key features and faculty the most. This gradient was eliminated when using SR, with all three training levels reporting a similar number of key features. These results are in contradistinction to Johnson et al., who found that residents had decreased completeness of reports when using SR to report on brain MRIs [11]. Nonetheless, increased reporting of key features for trainees can have positive effects. First, using a template that highlights important disease characteristics can enhance learning and help trainees develop an approach for a complex disease. Second, reporting became more consistent. Interestingly, accuracy was not improved among trainees using SR, and in general trainee accuracy was similar to faculty accuracy. Upper-level resident performance has been shown to be similar to that of faculty physicians for other tasks, and these data provide additional support [18].

Subjective analysis by referring physicians suggested a slight preference for SR. The referrers found it easier to extract information from SR and rated SR marginally higher with regard to identifying disease phenotype. The ability to understand anatomy was rated as high for both report types. These findings are similar to studies that have demonstrated preference for templated reports [9, 10, 29]. Despite these results, it is important to recognize that these are subjective measures and that attitudes may vary between physicians and across institutions. New templates should be reviewed with and tailored to specific groups of referrers. It is also important to note that the complexity of a given case may influence what report style is preferred. When presented with a report describing multifocal disease with certain aspects fluctuating over time, a templated report may prove challenging to read, whereas a relatively ‘negative’ report might be easy to read in a structured format. This point also applies to the interpreting radiologist, whose experience with SR or NSR may also differ depending on the complexity of a case.

Just as the referring physicians slightly preferred SR, the radiology readers in our study demonstrated some preference for SR. Overall, responses to subjective questions were largely positive or neutral toward SR, with no readers feeling that SR would decrease productivity. These results are in line with multiple studies suggesting radiologist preference for itemized reporting [29, 30].

While our data suggest that SR yields more key features with less variability, SR is not perfect. For example, wall thickening was reported less frequently in SR, possibly because it was not included as a discrete item on the template. Interestingly, wall thickness was documented in 88% of NSR, which means that radiologists using NSR often think about and subsequently describe the feature. The decrease using SR suggests that when using a template, interpreters may fail to comment on features not listed in front of them, features they might normally describe. It appears that the act of populating a standard report may change how readers report information, perhaps negatively in the case of this particular feature. As a result of these findings, we have added wall thickening to our institution’s template.

There are limitations to this study. First, a relatively small number of studies were included and the number of positive findings within the dataset for certain key features was low, which affects accuracy data. Next, the use of a template and its inclusion of various features helps prompt readers to report on key features, which may introduce bias toward more key feature reporting for SR. Recall that bias could affect results since each reader reviewed each study twice. A minimum four-week break between reading sessions was used to minimize this bias. In addition, readers knew that their reports were part of a study rather than part of the daily clinical workflow; this too could affect reporting behavior. Another limitation is the reference standard used to establish report accuracy: while two experienced faculty radiologists helped establish the standard, they did not have any surgical or pathologic correlate for each individual feature. Finally, we did not record dictation times or any objective assessment of reader productivity. Future efforts could include studying readers and reports within a true clinical workflow to better assess the impact on productivity.

Based on our results and the current literature, we feel that for assessment of complex disease processes where there is potential for miscommunication between referrer and radiologist, structured formats could provide benefit. In fact, our institution has incorporated templated reporting of CTE for IBD into its daily practice.

In conclusion, structured reporting of CTE for inflammatory bowel disease improved documentation of key reporting features for trainees and faculty, though there was minimal impact on accuracy. Referring physicians subjectively preferred the structured reports, and radiologists using structured reports did not view them negatively.