Introduction

Crohn’s disease (CD) presents as a chronic, relapsing inflammatory disorder affecting the gastrointestinal tract, characterized by transmural inflammation and a protracted clinical course. A frequently required surgical intervention in management of CD is ileocecal resection (ICR), especially when complications such as strictures, fistulas, or abscesses occur in the terminal ileum and cecum. Despite the efficacy of surgical intervention in alleviating symptoms and improving the quality of life for affected individuals, postoperative recurrence (POR) remains a formidable challenge, affecting a significant proportion of patients within the initial years following surgery [1,2,3].

Accurate and timely identification of postoperative recurrence is crucial for optimizing therapeutic strategies, preventing complications, and improving long-term outcomes [4].

Traditional CD management includes monitoring clinical symptoms using tools such as the Crohn’s Disease Activity Index, the Harvey-Bradshaw Index, or the Inflammatory Bowel Disease Questionnaire. However, clinical scores, despite their utility, tend to be subjective, and the presence of irritable bowel syndrome symptoms may complicate the accurate assessment of active inflammation [5]. Generally, ileocolonoscopy is considered the primary test for diagnosing POR in CD [6]. In this context, the Rutgeerts’ scoring (RS) system has been formulated, with scores of i2 (lesions confined to ileocolonic anastomosis) and higher indicating the presence of endoscopic recurrence [7]. Notably, a modified score of i2b (> 5 aphthous lesions with normal mucosa between the lesions) has gained recognition in recent years due to its higher correlation with surgical recurrence [8].

Non-invasive modalities have garnered increased attention recently, primarily due to challenges in conducting endoscopic examinations in post-surgical conditions where anatomical changes pose difficulties. Capsule endoscopy is among the favored methods, exhibiting 100% sensitivity and 69% specificity [9]. It excels in visualizing lesions in the proximal small bowel and demonstrates sensitivity in detecting early endoscopic recurrence [10,11,12]. Despite these advantages, limitations exist, such as the patency system, oral preparation challenges, difficulty in swallowing capsules, capsule retention, lack of extra-luminal information, and time-consuming aspects [13]. Intestinal ultrasound, with a pooled sensitivity and specificity of 94% and 84%, respectively [14], is gaining more attention, particularly due to its cost-effectiveness, time efficiency, lack of preparation requirements, and patient-friendly nature. However, it does have limitations in detecting early-stage conditions [13].

Computed Tomography Enterography (CTE) and Magnetic Resonance Enterography (MRE) have established roles in the diagnosis and management of CD [15, 16]. They can provide insights into the small bowel that may not be visible during ileo-colonoscopy. Additionally, they address the extra-luminal complications including evaluation of the submucosa and serosa contributing to the understanding of transmural healing [17, 18]. In a meta-analysis the pooled sensitivity of CTE and MRE was 85.8% and 87.9% for disease activity diagnosis of CD, respectively. Meanwhile, the pooled specificity of CTE and MRE was 83.6% and 81.2%, respectively [19]. They are helpful in evaluating small and large bowel [20, 21]. While a 2017 meta-analysis indicated a pooled sensitivity of 97% and specificity of 84% for MRE in POR of CD [9], the reliability of these findings is constrained by the limited number of studies and patients included. In this meta-analysis we aimed to investigate the role of enterography techniques in evaluation of the CD recurrence in post-operative condition.

Materials and methods

This study was designed following the guidelines outlined in the Preferred Reporting Items for Systematic Reviews and Meta-Analyses Protocols (PRISMA-P) checklist [22].

Search methods

We systematically searched English-language medical literature databases (Scopus, Web of Science, PubMed) until December 2023 for original peer-reviewed articles related to Crohn's disease, using specific Boolean search terms. The search criteria included (“Crohn*” OR “Crohn’s disease” OR “inflammatory bowel disease” OR “IBD”) AND ("magnetic resonance enterography" OR "MR enterography" OR "MRE" OR "MR entero*" OR "MRI" OR "computed tomography enterography" OR "CT enterography" OR "CTE" OR "CT entero*") AND ("postoperative" OR "post-operative" OR "post operative" OR "post-surg*" OR "post surg*" OR anastom* OR resect* OR recurren*"). Reference lists of identified articles were also reviewed. No publication date limit was applied.

Study selection

To be included in the analysis, eligible published original articles were required to meet the following criteria:

  1. 1.

    Evaluate CTE or MRE as a diagnostic method for the detection of POR of CD.

  2. 2.

    Utilize ileo-colonoscopy (RS ≥ i2) as the reference standard.

  3. 3.

    Provide sufficient data for the extraction of true positive (TP), true negative (TN), false positive (FP), and false negative (FN) results.

  4. 4.

    Have a minimum time interval of 3 months between the operation and the diagnosis of recurrence.

  5. 5.

    Maintain a reasonable time interval between ileo-colonoscopy and enterography.

The exclusion criteria encompassed abstracts lacking full articles, unpublished studies, notes, letters, comments, conference articles, and studies utilizing methods other than IC as the reference test. Studies that aggregated data for IBD as a whole, without discretely available CD data, were also excluded. Additionally, duplicated studies were excluded from consideration. Two authors (M. C. & S. Z.) independently reviewed the titles and abstracts to assess the study's eligibility. If both authors concurred, the study underwent a full-text review. Both authors collectively reviewed all articles during this phase, and if mutual agreement was reached, the study was selected. Any discrepancies were resolved through consensus or referred to a third reviewer (A. R.).

Data extraction

Two researchers independently conducted data extraction capturing the following information: title, abstract, authors' names, year of publication, nature of study (retrospective vs. prospective), details of the ground truth, and demographic features (including the number of patients, mean age, percentage of each gender, symptoms, and disease duration). Additionally, data extraction covered the imaging technique used, interpretation criteria, ileo-colonoscopy findings, and values of TP, TN, FP, and FN. Sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and accuracy were also recorded.

Furthermore, imaging features and their performance in the diagnosis of recurrence were extracted. These features included mucosal enhancement, bowel wall thickening, anastomotic luminal dilatation, pre-anastomotic ileal dilatation, penetrating disease, comb sign, mesenteric edema, lymph nodes, fibrofatty proliferation, and the length of the disease. Any discordances in the extracted data were resolved through consensus or referral to the third reviewer.

Assessment of methodological quality

Study quality was assessed by the Quality Assessment of Diagnostic Accuracy Studies-2 (QUADAS-2) [23]. Each question was assigned a response of high-risk, low-risk or unclear.

Statistical analysis

The extracted data were reported as a mean or percentage if they were continuous or categorical variables. We stablished RS ≥ i2b whenever it was available, otherwise RS ≥ i2 was used. Based on the extracted 2-by-2 contingency tables, a random effect method was used to pool the diagnostic performance measures, including sensitivity, specificity, negative predictive value (NPV), and positive predictive value (PPV). The bivariate model was conducted to find the summary points for sensitivity and specificity and their 95% confidence intervals (CI) considering the within and between-study heterogeneity. Each imaging feature was pooled if at least 3 papers have provided the required data. In this regard both diagnostic performance (sensitivity and specificity) and risk ratio (RR) were analyzed. Regarding the heterogeneity calculation, Higgins' I2 statistics and Cochran's Q test were used, and the results were interpreted following the Cochrane guideline [24]. In terms of diagnostic test accuracy studies, an essential factor for heterogeneity screening is the threshold effect. Calculation of a linear correlation between sensitivity and false-positive ratio would demonstrate this effect, considering r ≥ 0.6 as significant [25].

All analyses were conducted by the "meta", "metafor", and "mada" package in R statistical analysis software (version 4.2.1, R Foundation for Statistical Computing, Vienna, Austria).

Results

Study characteristics

The initial search yielded 747 studies. After removing duplicates, 504 studies underwent title/abstract review, resulting in the exclusion of 376 studies at this stage. During the full-text review of the remaining 42 articles, 32 were deemed ineligible based on the aforementioned selection criteria. Ultimately, 11 studies, comprising 11 populations and 589 patients (4 CTE and 7 MRE population with 248 and 341 patients, respectively), were included in the meta-analytic calculations [26,27,28,29,30,31,32,33,34,35,36]. Notably, in the study by Bachour et al., two separate datasets for CTE and MRE were investigated independently [27]. In one study, while crude data for diagnostic performance was not provided, detailed information on imaging features was available and utilized for subsequent analysis [36]. Figure 1 illustrates the flow diagram outlining the inclusion process, while Table 1 provides a comprehensive overview of the characteristics of the included studies.

Fig. 1
figure 1

Flowchart of the study selection process for the systematic review

Table 1 Characteristics of included studies

Diagnostic performance

The overall sensitivity and specificity of enterography exams were found to be 91.1% (95% CI: 0.848–0.949) and 74.7% (95% CI: 0.564–0.871), respectively (Figs. 2, and 3). The NPV and PPV were also 86.3% (95% CI:0.800–0.901) and 81.7% (95% CI:0.677–0.905), respectively. Interstudy heterogeneity was low for sensitivity (I2 = 29%, P-value = 0.17), indicating relatively consistent results across studies. However, for specificity, significant heterogeneity was observed (I2 = 85%, P-value < 0.01), suggesting variability in study outcomes. The Spearman correlation between sensitivity and specificity was -0.213 (95% CI: -0.721 – 0.444, P-value = 0.53). This finding suggests that, despite a decrease in specificity with increasing sensitivity, the correlation is not statistically significant. Thus, it indicates that a threshold effect is not a significant concern in this analysis.

Fig. 2
figure 2

Forest plot for pooled sensitivity of the studies. Horizontal lines represent 95% CIs of the individual studies

Fig. 3
figure 3

Forest plots for pooled specificity of the studies. Horizontal lines represent 95% CIs of the individual studies

MRE

Seven studies, encompassing a total population of 341 patients, were included in the analysis for MRE. The sensitivity and specificity were calculated at 90.4% (95% CI: 0.781–0.961) and 77.5% (95% CI: 0.567–0.901), respectively. the NPV and PPV were 88.6% (95% CI:0.79–0.941) and 78.9% (95% CI:0.622–0.894), respectively. Interstudy heterogeneity was low for sensitivity (I2 = 24%, P-value = 0.25), indicating relatively consistent results across studies. Nevertheless, for specificity, significant heterogeneity was observed (I2 = 79%, P-value < 0.01).

CTE

Four studies involving a total population of 248 patients were included. The sensitivity and specificity were calculated at 92.6% (95% CI: 0.869–0.960) and 68.6% (95% CI: 0.34.6–0.901), respectively. the NPV and PPV were 82.5% (95% CI:0.704–0.903) and 85.6% (95% CI:0.570–0.964), respectively. While no interstudy heterogeneity was observed for sensitivity (I2 = 0%, P-value = 0.56), a substantial degree of heterogeneity was noted for specificity (I2 = 88%, P-value < 0.01).

Imaging features

Imaging features such as lymph nodes, fibrofatty proliferation, and disease length fell short of the minimum study count for inclusion in the meta-analysis.

Table 2 provides detailed analysis of calculated imaging features. All included features were significantly higher in POR patients; however, the highest RR was for mucosal enhancement (2.33, 95%CI: 1.83–2.98). Furthermore, the highest sensitivity was observed in wall thickening (81.2%, 95%CI: 0.618–0.918), while the highest specificity was found in penetrating lesion (97.1, 95%CI: 0.899–0.992). Also, perivisceral edema, comb sign, and pre-anastomosis dilatation were other highly specific features with more than 90% specificity. Detailed information of each feature is provided in online supplementary file.

Table 2 Diagnostic value and risk ratio of each imaging feature

Wall thickening was consistently defined as > 3mm thickness across all studies. However, regarding other features, minor discrepancies were noted among the studies. In terms of contrast enhancement, Soyer et al. utilized a semiquantitative grading system (0–2) [36], while Baillet et al. employed a criterion of relative contrast enhancement > 100% [28]. The definition of mucosal enhancement was not clearly outlined in other studies, possibly contributing to the high heterogeneity observed in the pooled analysis of this feature.

The definition of anastomosis narrowing varied among the studies. Bachour et al. defined it as luminal narrowing > 50% of the normal luminal diameter, while Pozassero and Soyer used an anastomosis diameter of ≤ 12mm. Regarding pre-anastomosis dilatation, Bachour et al. set the criterion at diameter > 2.5 cm, while Pozassero and Soyer used diameter > 3 cm [27, 34, 36].

Schaefer did not mention criteria for pre-anastomosis dilatation, wall thickening, and anastomosis narrowing [35].

Heterogeneity and quality assessment

Heterogeneity results for each analysis are available in the online supplementary figures. Some heterogeneity may be attributed to minor differences in reference standard cut-offs, as newer papers adopted a modified RS ≥ i2b compared to the traditional RS ≥ i2 for defining POR. Additionally, despite efforts to establish criteria, the lack of discrete imaging criteria for differentiating recurrence cases introduces subjectivity, relying on reviewers' experience. Heterogeneity is further influenced by variations in the evaluated population, especially in cases with borderline or early-stage disease, where imaging modalities may be indefinite. Addressing these factors is crucial for a comprehensive understanding of the observed heterogeneity in the study results.

Details of the quality assessment are available in Fig. 4, revealing an overall low risk of bias among the included papers. In patient selection, only one paper posed a high risk due to enrolling patients with a high risk of recurrence. Additionally, the head-to-head data of IC and CTE were partially available for some patients in one study, lacking detailed information, leading to uncertainty about the risk of bias in this aspect. Concerns were raised about the potential impact of unclear methodologies on the index test, while the reference standard and flow and timing demonstrated low risk across all studies.

Fig. 4
figure 4

QUADAS-2 checklist result for the internal validation of the included studies

Discussion

In this meta-analysis, our findings indicate that enterography techniques, particularly MRE, exhibit high sensitivity and acceptable specificity for diagnosing POR in patients with CD. Additionally, through an analysis of available imaging features, we identified penetrating lesion, perivisceral edema, comb sign, and pre-anastomosis dilation as the most specific signs. Conversely, wall thickening and mucosal enhancement emerged as the most sensitive features.

While ileo-colonoscopy is widely regarded as the gold standard for detecting POR, its partially invasive nature can be displeasing for patients. The growing demand for less invasive methods has underscored the importance of imaging modalities, positioning them as valuable screening tools. These modalities are expected to exhibit high sensitivity and NPV, aiming to minimize the need for invasive procedures and enhance the overall patient experience. This is a crucial step in applying the "treat-to-target" strategy in CD surveillance. In this regard, enterography emerges as a suitable and patient-friendly choice.

The sensitivity exceeded 80% in all studies except one. In this study the authors employed MaRIA score for POR diagnosis. Their findings suggested that a score greater than 3.76 yielded the best accuracy, demonstrating 61% sensitivity and 82% specificity [28]. However, it should be noted that MaRIA score is not originally designed for POR assessment.

Our results reveal a low specificity of enterography, indirectly indicating a higher number of FP cases compared to the gold standard test. However, debates surround this issue. Generally, ileo-colonoscopy within one year of ileocolonic resection is considered the gold standard for POR diagnosis. Despite lacking formal validation, the RS has gained widespread acceptance in clinical practice and is routinely employed in clinical trial settings. However, doubts persist. This scoring system, initially developed in 1990, focuses on lesions identified at the neo-terminal ileum and the ileocolonic anastomosis, specifically in patients who have undergone ileocolonic end-to-end anastomosis [3]. Later on, it was proposed that patients with multiple lesions in neo-terminal are more prone to POR than those with mild lesions in anastomosis site therefore modified score of i2b was considered more important [37]. This was enhanced by further studies emphasizing the clinical significance of modifies score [38,39,40,41]. Conversely, other studies showed that this discrimination does not necessarily affect the recurrence rate in follow up [42, 43]. Besides, the score has lack of reproducibility in inter-observer agreement [44,45,46] and is not clearly studied in other types of anastomoses. Given these dilemmas, recent trials are actively seeking improved criteria for predicting postoperative recurrence [40, 47]. Nevertheless, the issue remains a subject of ongoing debate and scrutiny.

Bachour et al., with the lowest specificity among the included studies, proposed that the occurrence of FP cases could be attributed to active proximal ileal or colon lesions, which are not encompassed in the RS. They attribute this observation to the inherent capability of cross-sectional imaging to identify lesions beyond the mucosal layer [27]. Similarly, Schaefer et al. associated the low specificity with the limitations and pitfalls of the RS [35].

A critical challenge in reading enterography images is the lack of well-defined set of imaging criteria available for the diagnosis of POR. The absence of standardized criteria is noteworthy, potentially contributing to heterogeneity across studies. The significance of this issue becomes apparent when examining the results, particularly regarding the diverse approaches to imaging criteria. For example, early studies utilized MR scoring systems where scores 1–3 were classified as morphological recurrence. Two studies that used this scoring system reported 100% sensitivity, but specificity was notably low, reaching 40% in one of them [26, 31]. Recent publications predominantly favor subjective assessments, using combination of multiple imaging features. Recently, Schaefer et al. introduced the MONITOR index as a tool for predicting POR through MRE [35]. The index comprises seven imaging features, including wall thickening, contrast enhancement, T2 signal increase, DWI signal increase, length of disease ≥ 20mm, edema, and ulcer. Each feature scores 1 if positive, while the presence of an ulcer is scored 2.5. The authors determined that the optimal cut-off for sensitivity is a score greater than 1, yielding 79% sensitivity and 55% specificity [35]. While this marks a valuable stride in the standardization of interpretation, it remains imperative for additional studies to thoroughly assess the performance and efficacy of the index.

To establish a valid index, it is crucial to identify the most important imaging features. Our investigation revealed that the key findings include mucosal enhancement, wall thickening, and penetrating lesions. Mucosal enhancement exhibited the highest risk ratio, demonstrating acceptable sensitivity and specificity. Wall thickening emerged as the most sensitive finding, while penetrating lesions were the most specific. This is concordant with Schaefer et al. findings where the ulcer was the most significant observation [35]. However, the finding has very low sensitivity which is a limitation in the setting of screening.

Wall thickening is often regarded as the initial indicator to consider during image interpretation; however, its specificity is compromised by potential overlap with fibrosis cases [36]. Pozassero et al. found 82% sensitivity and 75% specificity (total 82% accuracy) in the diagnosis of anastomotic recurrence compared to fibro-stricture [34]. Soyer et al. identified that stratification and the comb sign stood out as the two most discriminative independent features for effectively distinguishing between recurrence and fibro-stricture in CTE [36]. To enhance differentiation, attention to additional imaging features becomes imperative, especially in MRE. Notably, high T2 intensity and restriction on DWI sequences have proven to be valuable findings [48,49,50,51]. Both of these features were incorporated into the MONITOR index introduced by Schaefer et al. [35]. It is crucial to highlight that DWI is not a standard component of the MRE protocol. Despite being highly valuable in the context of CD, its specific role in the context of POR is not extensively studied [52]. Djelouah et al. aimed to elucidate the value of DWI in the POR contextand demonstrated that the addition of DWI to contrast-enhanced MRE could marginally enhance sensitivity without altering specificity.

Beyond the analyzed features there are other findings that could be of value. Length of the disease was considered important, although the cut-offs were different [27, 35]. The evaluation of mesenteric lymph nodes has also diverged among studies, with some employing size, utilizing different cutoffs as the measure of analysis, while others have focused on the presence of enhancement [34,35,36]. To gain a deeper understanding of the issue at hand, further comprehensive studies are essential.

Regarding the comparison of CTE and MRE, our analysis leans toward favoring the MRE. This preference holds significance as patients with CD often require multiple follow-up imaging sessions, necessitating a modality that is both convenient and harmless. The sole study directly comparing the performance of these two modalities in the same population, conducted by Boucher et al., indicated that while CTE exhibited slightly higher sensitivity, its specificity was notably lower than MRE [27]. This finding further underscores the value of MRE in this context. Additionally, it has been proposed that in the surveillance of CD patients, MRE demonstrates higher accuracy and inter-reader agreement, emphasizing its potential superiority in long-term monitoring [53].

This study is subject to several limitations, with the most significant being the heterogeneity observed among the included patients. Various factors contribute to this heterogeneity, including the duration of the disease, the duration and type of pre- and post-surgical medical treatment, the severity of disease activity, the type of surgical anastomosis, and the presence of concomitant complications. Another source of heterogeneity lies in the variability of image acquisition protocols across the studies. Furthermore, the interpretation of images was predominantly subjective in nature among the included studies. These limitations underscore the need for caution when generalizing the findings and highlight the importance of future studies addressing these confounding factors for a more nuanced understanding of the topic.

In conclusion, our study demonstrates that both MRE and CTE exhibit high sensitivity and acceptable specificity, with MRE showing particular promise, for the detection of POR in CD. This positions them as effective initial screening tools, potentially allowing for the reserved use of ileo-colonoscopy in cases where enterography results are inconclusive. However, it is imperative for future studies to concentrate on identifying the most valuable imaging features and strive toward standardizing the interpretation of imaging results.