Introduction

Autoimmune pancreatitis (AIP) is a particular type of chronic pancreatitis characterized by periductal infiltration by IgG4-positive plasma cells leading to periductal and interlobular fibrosis [1]. AIP can be categorized into diffuse, focal, or multifocal AIP according to the involvement pattern [2]. As 28–48% of all AIP can manifest as a focal pancreatic mass with pancreatic duct stricture, which can also be shown in the case of pancreatic ductal adenocarcinoma (PDAC) [3,4,5,6], and both AIP and PDAC have common epidemiologic and clinical manifestations such as preponderance in elderly men and presentation with painless jaundice [7], it is challenging to differentiate AIP from PDAC. Considering the different treatment strategies and prognoses between AIP and PDAC [8, 9], the accurate differential diagnosis of the two diseases is critical to prevent unnecessary surgical resection in patients with AIP, with it being reported that 3–9% of patients who underwent resection for a presumed PDAC actually had AIP [10].

Imaging tests including computed tomography (CT) and magnetic resonance imaging (MRI) have been widely used in the diagnosis and management of patients with AIP. The characteristic imaging findings play an important role in the diagnosis of AIP in most classification systems, including the Japanese consensus guidelines [11], the Mayo clinic HISORt criteria [12], and the International Association of Pancreatology guidelines [13]. Various CT and MRI imaging features are suggested to be valuable for the differential diagnosis between AIP and PDAC, including the morphology of the pancreatic mass, pattern of the pancreatic duct stricture, and enhancement pattern [3,4,5,6, 8, 9, 12, 14,15,16,17,18,19,20,21,22,23,24].

Previous studies have reported the diagnostic performance of CT and MRI for differentiating AIP from PDAC when using these imaging features, but the reported values are quite variable, i.e., 18–94% for sensitivity and 85–100% for specificity [8, 9, 12, 16, 18, 21, 23, 25,26,27,28]. In addition, as only one study [9] actually compared the diagnostic performance for differentiating AIP from PDAC between CT and MRI, there is only limited information on which to determine the better imaging test for differentiating AIP from PDAC. Considering that MRI has higher soft-tissue contrast than CT, which is useful for detecting focal pancreatic masses [9], we hypothesized that MRI might have the better diagnostic performance for differentiating AIP from PDAC.

Therefore, we aimed to systematically determine the diagnostic performance of CT and MRI for differentiating AIP from PDAC with a meta-analytic comparison between the two tests.

Materials and methods

This meta-analysis followed the Preferred Reporting Items for Systemic Reviews and Meta-Analysis (PRISMA) guidelines for conducting and reporting a study. The analysis was executed using methods advocated by the Diagnostic Test Accuracy Working Group of the Cochrane Collaboration and the Agency for Healthcare Research and Quality (AHRQ) [29].

Literature search strategy

A comprehensive literature search of PubMed and EMBASE databases was conducted. The query terms were designed for a sensitive literature search and included the search terms of “Pancreas,” “Pancreatitis,” “Pancreatic neoplasm,” “MRI,” “CT,” and “Diagnosis.” The detailed search terms are listed in Supplementary Table 1. Because of the recent fast evolution of imaging techniques, the search included only articles published between January 1, 2009, and December 31, 2019. The studies were limited to English language articles and human patients.

Eligibility criteria

Duplicate articles were removed and the eligibility of articles was assessed by one author according to the following criteria: (a) patients: patients with AIP or PDAC; (b) index test: MRI with or without contrast enhancement; (c) comparison: CT with or without contrast enhancement; (d) outcomes: diagnostic accuracy for differentiating AIP from PDAC; (e) study design: both observational (retrospective or prospective) and clinical trials; and (f) reference standard: both pathological or clinical diagnoses. The listed exclusion criteria were as follows: (a) studies with duplicate patients and data; (b) studies without sufficient information to create a 2 × 2 diagnostic table; (c) review articles, comments, letters, editorials, case reports, and conference articles; and (d) studies not in the field of interest. Two reviewers (with 9 and 5 years of abdominal radiology) screened the abstracts and titles of the retrieved articles and reviewed the full text of potentially eligible articles. Both review sessions were conducted independently, and articles with definite ineligibility were excluded. Articles with discordant eligibility were reviewed in a consensus meeting with a third reviewer (who had 13 years of experience in abdominal radiology).

Data extraction

The following data were extracted from eligible articles: (a) study characteristics (authors, publication date, study period, study design, and study type); (b) study subject characteristics (total number of patients, age, and the number of patients with the underlying disease); (c) lesion characteristics (total number of lesions, the number of patients with the underlying disease, AIP type [focal or diffuse], AIP diagnostic criteria); (d) CT techniques (use of contrast-enhanced images, multi-phase enhanced images, slice thickness, and number of channels); (e) MRI techniques (magnetic field strength, use of contrast-enhanced images, magnetic resonance pancreatography [MRP], and diffusion-weighted imaging [DWI]); (f) the method of image review (single reviewer, multiple independent reviewer, or multiple reviewers in consensus); (g) analyzed imaging features; (h) the clarity of blinding to the reference standard during image review; and (i) the reference standard (pathologic diagnosis or clinical diagnosis). Regarding the reference standard, the number of cases determined by pathologic diagnosis and clinical diagnosis was extracted, and the specific criteria for AIP and the method of pathologic diagnosis were investigated. To determine diagnostic accuracy, the numbers of true positives, false positives, false negatives, and true negatives were extracted. In the case of multiple sets of results due to a use of each individual feature, the result set with the highest Youden index value was used. The two reviewers performed data extraction independently, and all disagreements were re-evaluated in a consensus meeting with a third reviewer.

Assessment of study quality

Two reviewers independently assessed the quality of the eligible articles using the Quality Assessment of Diagnostic Accuracy Studies 2 (QUADAS-2) criteria [30]. Four different domains were assessed: patient selection, index test, reference standard, and flow of patients through the study.

Data synthesis and statistical analysis

Accuracy of CT and MRI for differentiating AIP from PDAC

To investigate the effect of the type of imaging test on the diagnostic accuracy, the results of all articles were segregated and analyzed as separate studies according to the type of imaging test (CT vs. MRI). All reported results were analyzed on a per-patient basis. The sensitivity and specificity for differentiating AIP from PDAC and their 95% confidence intervals (CIs) were determined using the relevant extracted data from each individual study. Sensitivity was defined as those patients showing AIP lesions on imaging tests divided by all patients diagnosed as AIP, while specificity was defined as those patients showing PDAC lesions on imaging tests divided by all patients diagnosed as PDAC. The meta-analytic summary sensitivity and specificity were calculated for both CT and MRI using a hierarchical modeling method, i.e., a bivariate random effects model. The hierarchical summary receiver operating characteristic (HSROC) model was used to acquire the summary receiver operating characteristic curve [29]. The diagnostic performance for differentiating AIP from PDAC was compared between CT and MRI using meta-regression analytic methods. In addition, a subgroup analysis was performed for the differentiation of focal AIP from PDAC.

Higgins I2 statistic was used to examine for the presence of heterogeneity in the sensitivity and specificity of the included studies. When substantial study heterogeneity (I2 > 50%) was present, the presence of a threshold effect was evaluated using visual assessment of coupled forest plots of sensitivity and specificity. In addition, the Spearman correlation coefficient between the sensitivity and FP rate was evaluated, with a correlation coefficient > 0.6 being considered to indicate a considerable threshold effect.

Meta-regression analysis

Meta-regression was conducted to explore the causes of heterogeneity between the studies. The covariates used in the meta-regression included (a) patient composition (AIP > PDACA vs. PDAC > AIP); (b) publication year (before 2015 vs. after 2015); (c) CT slice thickness (≤ 3 mm vs. > 3 mm); (d) the use of multi-phase enhanced CT (multi-phase vs. single-phase); (e) MR magnet strength (1.5 T only vs. others); (f) the use of MRP (with MRP vs. without MRP); (g) the use of DWI (with DWI vs. without DWI); (h) image review (multiple independent reviewers vs. single reviewer or multiple reviewers with consensus); and (i) the clarity of blinding to the reference standard during review (clear vs. unclear).

Analysis of publication bias

Publication bias was determined by visual assessment of a Deeks’ funnel plot, and statistical significance was evaluated by Deeks’ asymmetry test.

All statistical evaluations were conducted using Stata version 15.0 (StataCorp LP), with p < 0.05 being taken to indicate statistical significance.

Results

Literature search

Overall, 856 articles were retrieved after elimination of duplicate articles. Of these potentially eligible articles, 80 were initially identified as possibly being of interest according to their titles and abstracts, and then, a further 69 articles were excluded after full text evaluation. A schematic diagram of the study inclusion process is presented in Fig. 1. Among the remaining 11 articles [8, 9, 12, 16,17,18, 21, 23, 26,27,28], five reported the diagnostic accuracy of CT [8, 12, 18, 27, 28], four reported the diagnostic accuracy of MRI [16, 17, 21, 23], and two reported the diagnostic accuracy of both CT and MRI [9, 26].

Fig. 1
figure 1

Flow diagram of the study selection process for the systematic review and meta-analysis

The characteristics of the included articles are listed in Table 1. All included studies were retrospective case-control studies. Regarding the type of AIP, five studies included only focal AIP [16, 21, 26,27,28], whereas six included both diffuse and focal AIP [8, 9, 12, 17, 18, 23]. Of the seven studies evaluating CT, one used single-phase contrast-enhanced images [12], whereas six used multi-phase contrast-enhanced images [8, 9, 18, 26,27,28]. Four studies used thin slices (≤ 3 mm) [9, 18, 27, 28], whereas the others used a slice thickness > 3 mm or did not report the slice thickness [8, 12, 26]. Of the six studies evaluating MRI, one used both 1.5-T and 3-T MRI [16], whereas five used only 1.5-T MRI [9, 17, 21, 23, 26]. One study did not report the magnetic field strength of the scanner [26]. Five studies used MRP [9, 16, 21, 23, 26], and three studies used DWI [9, 16, 17].

Table 1 Characteristics of the included articles

For diagnosing AIP, 10 studies used clinical diagnostic criteria [8, 9, 12, 16,17,18, 21, 23, 26, 28], and one study used pathology as a reference standard for AIP [27] (Supplementary Table 2). Of the 10 studies using clinical diagnostic criteria, three studies used one diagnostic criterion, and the remaining seven studies used a combination of diagnostic criteria, including the Mayo clinic HISORt criteria, Japanese diagnostic criteria, International Consensus Diagnostic Criteria, Asian diagnostic criteria, and Korean diagnostic criteria. For diagnosing PDAC, nine studies used pathology only [8, 9, 16,17,18, 21, 23, 27, 28], and two studies used both clinical diagnosis and pathology as a reference standard for PDAC [12, 26]. Of the nine studies using pathology only, two studies used surgical specimens, and the remaining seven studies used the pathology of both surgical specimens and biopsies.

The detailed imaging features for differentiating AIP from PDAC in 11 studies are summarized in Supplementary Table 3. Nine studies reported the diagnostic performance of each individual imaging feature [12, 16,17,18, 21, 23, 26,27,28], but two studies reported the diagnostic performance of CT or MRI considering multiple imaging features together using a 3-point or 5-point scale system [8, 9].

Study quality according to QUADAS-2

The quality of the 11 finally included articles is summarized in Fig. 2. The method for patient selection and the clarity of blinding to the reference standard during review were remarkable areas of quality concern. Ten studies did not include consecutive patients, and all the included studies were of a case-control design. All of the included studies were unclear about blinding to the results of the index test when interpreting the reference standard, and also about blinding to the results of the reference standard while interpreting the index test. Furthermore, a risk of bias was noted in the flow and timing criteria, with ten studies showing uncertainty as to whether the time interval between the index text and reference standard was appropriate.

Fig. 2
figure 2

Quality assessment of the studies according to QUADAS-2 criteria. The methodological quality of the articles is presented as the proportion of articles (0–100%) with low (i.e., high quality), high, or unclear risk of bias, and the proportion of articles with low (i.e., high quality), high, or unclear concerns regarding the applicability of each domain

Diagnostic performance of CT and MRI for differentiating AIP from PDAC

Both the sensitivities (17–86%) and specificities (85–100%) of the individual CT studies (806 patients in seven studies) showed considerable variability (Table 2). The meta-analytic summary sensitivity and specificity of CT were 59% (95% CI, 41–75%) and 99% (95% CI, 88–100%). Substantial study heterogeneity was found in both sensitivity and specificity (I2 = 88% and 81%, respectively); however, the coupled forest plot of sensitivity and specificity did not show a threshold effect (Fig. 3a), and the Spearman correlation between sensitivity and FP rate was 0.8 (p = 0.12), indicating a positive correlation without a statistical significance. The HSROC curve revealed a quite difference between the 95% confidence and prediction regions, thereby indicating considerable heterogeneity across the studies (Fig. 3b).

Table 2 Diagnostic performance of CT and MRI for differentiating AIP from PDAC
Fig. 3
figure 3

Coupled forest plots and HSROC curve of the sensitivity and specificity for the differential diagnosis of AIP from PDAC on CT (a, b) and MRI (c, d)

For MRI (485 patients in six studies), both sensitivity and specificity showed considerable variability across individual studies (44–94% for sensitivity and 77–100% for specificity), with meta-analytic summary sensitivity and specificity of 84% (95% CI, 68–93%) and 97% (95% CI, 87–99%), respectively (Table 2). Substantial study heterogeneity was found in both sensitivity and specificity (I2 = 75% and 81%, respectively), but the coupled forest plot of sensitivity and specificity did not show a threshold effect (Fig. 3c), and the Spearman correlation between sensitivity and FP rate was 0.79 (p = 0.06), indicating a positive correlation without a statistical significance (Fig. 3d).

When we compared the diagnostic performance for differentiating AIP from PDAC between CT and MRI, MRI had a significantly higher meta-analytic summary sensitivity than CT (84% vs. 59%, p = 0.02) but a similar specificity to CT (97% vs. 99%, p = 0.18) (Fig. 4).

Fig. 4
figure 4

A 62-year-old man with diffuse AIP (author’s own collection). a, b CT shows diffuse parenchymal swelling on the arterial-phase image and a low density halo (arrow) surrounding the pancreas on the delayed-phase image. c, d MRI shows T2 hyperintensity parenchymal change with delayed enhancing rim (arrow)

Subgroup analysis for differentiating focal AIP from PDAC

The meta-analytic summary sensitivities and specificities for differentiating focal AIP from PDAC were 50% (95% CI, 16–85%) and 98% (95% CI, 93–100%), respectively, for CT, and 76% (95% CI, 54–99%) and 97% (95% CI, 91–100%) for MRI, with MRI showing a higher meta-analytic summary sensitivity than CT, although the difference was not statistically significant (76% vs. 50% respectively, p = 0.28) (Fig. 5). Both MRI and CT had similar specificities (97% vs. 98% respectively, p = 0.07).

Fig. 5
figure 5

A 54-year-old man with focal AIP (author’s own collection). a, b CT shows a 2.0-cm ill-defined low density lesion (arrow) in the pancreatic head portion with upstream pancreatic duct dilatation. c, d This lesion appears more conspicuous on MRI (arrow) than on CT, demonstrating pancreatic ductal dilatation and stricture more clearly. e MRCP shows multifocal stricture and dilatation with the duct penetrating sign (visible duct within a mass) (arrow)

Meta-regression analysis

The meta-regression analysis results are summarized in Table 3. For both CT and MRI, the year of publication was a significant factor associated with study heterogeneity (p ≤ 0.05). Studies published after 2015 showed a higher sensitivity than those published before 2015 for both CT (83% vs. 49%) and MRI (93% vs. 74%). In the CT studies, slice thickness, the method of image review, and the clarity of blinding during the review were also significantly associated with study heterogeneity (p ≤ 0.03). Studies with a CT slice thickness ≤ 3 mm had a higher sensitivity than those with a slice thickness > 3 mm (72% vs. 42%), but the specificities were similar (93% vs. 100%). In addition, studies with multiple independent reviewers and studies with clear blinding during the review had higher sensitivities than those with single or multiple reviewers with consensus, and those with unclear blinding during the review.

Table 3 Results of the meta-regression analysis

A remarkable publication bias was noted for MRI (p = 0.02) but not for CT (p = 0.28; Supplementary Fig. 1).

Discussion

This study showed very high meta-analytic summary specificities for both CT (99%) and MRI (97%) for the differentiation of AIP from PDAC. However, the meta-analytic summary sensitivity of CT was 59%, whereas that of MRI was 84%. Compared with CT, MRI had a significantly higher sensitivity for differentiating AIP from PDAC (p = 0.02) but a similar specificity (p = 0.18).

The higher sensitivity of MRI found in this study is in agreement with a recent study that reported MRI as having a higher sensitivity than CT (88.5–90.2% vs. 77–80.3%, p ≤ 0.07) [9]. Although this previous study did not show a statistically significant difference between CT and MRI, our meta-analysis showed MRI to have a significantly higher sensitivity than CT in the diagnosis of AIP. The higher sensitivity of MRI might be due to the higher soft-tissue contrast of MRI, which enables the detection of focal pancreatic masses [9]. Although the detection of a subtle mass or small lesion might be difficult on CT because of the poor soft-tissue contrast between focal mass–like AIP lesions and the normal pancreatic parenchyma, MRI including fat-suppressed pre-contrast T1-weighted images can demonstrate even small focal pancreatic masses with a high lesion conspicuity [31]. Given previous findings that multiplicity of AIP was more frequently observed on MRI than on CT (33–44% vs. 6%), the higher sensitivity of MRI for detecting AIP in comparison with CT seems reasonable [8, 16, 32].

Although there was wide heterogeneity among the multiple proposed criteria for AIP [11,12,13], which was likely due to regional and ethnic differences in pathologic and clinical manifestations of AIP [7], they commonly include imaging evidence as a key feature, i.e., diffuse or focal enlargement of the pancreas with a diffuse or segmental pancreatic duct stricture. Most guidelines recommend CT, MRI, or endoscopic retrograde pancreatography for the evaluation of AIP; however, there is a lack of information about the choice of imaging tests. In this study, MRI had a significantly higher sensitivity than CT for detecting AIP. In addition, previous studies reported that MRI can be more useful than CT for evaluating pancreatic duct stricture [5, 9, 33], with reported detection rates for pancreatic duct stricture of 85.2% for MRI and 54.1% for CT [9]. This is because MRI using heavily T2-weighted cross-sectional imaging or MRP can illustrate minimal or mild pancreatic duct strictures with absent or minimal upstream pancreatic duct dilatation, whereas CT has limitations when it comes to showing subtle pancreatic duct abnormalities [9]. In the case of differential diagnosis between AIP and PDAC, MRI may be more useful than CT, considering the higher detection of multiple lesions and the better ductal imaging in MRI. Although the differential diagnosis of focal AIP from PDAC is challenging, combining these imaging features with serum IgG4 levels, serum tumor markers, and responsiveness to steroid therapy would be helpful to differentiate between the two diseases in the clinical practice [7, 9].

Of the 11 included studies, Naitoh et al reported low sensitivity of 17% in CT [26]. Because this study evaluated only three CT findings including mass size, delayed enhancement, and capsule-like rim, the sensitivity of this study might be limited. In addition, Hur et al reported low sensitivity of 44% in MRI [16]. Given the lower sensitivity of focal AIP than diffuse AIP, the low sensitivity in this study might be explained by patients consisting of only focal AIP. In addition, both studies reported the diagnostic performance of each CT or MRI feature, but the reported data varied according to each individual feature. Therefore, further studies using combined imaging features will be needed to determine the diagnostic performance of CT or MRI.

Publication year was a significant factor affecting study heterogeneity in both CT and MRI. As technical advances such as thin-slice images and three-dimensional imaging lead to improvements in the diagnostic performance of imaging tests [34], diagnostic performance may differ according to publication year. In addition, given that the diagnostic test accuracy is based upon a comparison between the results of the index test and those of the reference standard, knowledge of the reference standard may influence the interpretation of index test results [35], and the clarity of blinding during review might be a factor influencing study heterogeneity.

This study has several limitations. First, substantial study heterogeneity was noted in both the sensitivity and specificity of CT and MRI. Therefore, to explore the causes of study heterogeneity, we evaluated the presence of a threshold effect between sensitivity and specificity, and performed meta-regression analysis. Second, we could not show a statistically significant difference between CT and MRI in the subgroup analyses for differentiating focal AIP from PDAC, even though MRI had a significantly higher sensitivity than CT in all eligible studies. This may be due to the small number of studies with focal AIP only (three studies for CT and four studies for MRI). Further studies are needed to verify that MRI has a higher sensitivity than CT in the differentiation of focal AIP from PDAC.

In conclusion, both CT and MRI had very good diagnostic performance for the diagnosis of PDAC, but both showed suboptimal moderate diagnostic performance for AIP. Compared with CT, MRI had a significantly higher sensitivity for the detection of AIP lesions. Therefore, MRI may be more useful for evaluating patients with AIP in the clinic, especially for the differentiation of AIP from PDAC.