Introduction

Inflammatory pancreatic masses (IPM) primarily encompass mass-forming pancreatitis (MFP) and autoimmune pancreatitis (AIP). The imaging appearance of both tumors mimics pancreatic cancer (PC) [1, 2]. However, PC is a highly fatal disease and carries an extremely poor prognosis with the lowest 5-year relative survival rate among all cancers [3]. Surgical resection remains the only curative method available. In contrast, surgery is not the optimal treatment option for patients with IPM. Given that postoperative morbidity rates remain high, it is essential to accurately differentiate between IPM and PC to avoid unnecessary surgery [4]. Magnetic resonance (MR) imaging has the advantages of no radiation exposure and high soft tissue contrast. Quantitative MR parameters can serve as non-invasive imaging biomarkers to distinguish between malignant pancreatic neoplasms from inflammatory masses.

Several quantitative MR imaging biomarkers have been widely used in clinical practice, while only a select few have been investigated in limited patient cohorts with pancreatic diseases [5,6,7,8]. Magnetic resonance cholangiopancreatography (MRCP) can evaluate the pancreatic duct system, and the maximal diameter of the upstream main pancreatic duct (dMPD) was used as an effective imaging biomarker in the diagnosis of AIP [9, 10]. Diffusion-weighted imaging (DWI) is a technique that assessed the Brownian motion of water molecules noninvasively, and the apparent diffusion coefficient (ADC) values reflected the degree of diffusion restriction [11]. Notably, the study findings regarding ADC value, which is the most used quantitative MR imaging biomarker, were controversial, and recent reviews have also pointed to this issue [12, 13]. Nevertheless, we observed that certain earlier studies conflated the concepts of the two IPM entities, namely, MFP and AIP, which exhibit distinct pathological characteristics and appear to differ in their ADC values [14,15,16,17]. Intravoxel incoherent motion (IVIM) is an extension of DWI with multiple b values that can provide quantitative information on the molecular diffusion and microcirculation perfusion effects [18, 19]. The following are the IVIM-derived parameters: slow component of diffusion (Dslow) represents the pure molecular diffusion, fast component of diffusion (Dfast) represents the incoherent microcirculation, and perfusion fraction (f) is the diffusion fraction related to microcirculation [11]. Moreover, magnetic resonance elastography (MRE) has emerged as a useful modality for quantitative assessment of mass stiffness [20].

To the best of our knowledge, the above-mentioned imaging biomarkers have not been meta-analyzed to distinguish between IPM and PC.

Therefore, this systematic review and meta-analysis aimed to determine the usefulness of quantitative MR imaging biomarkers for distinguishing between IPM and PC.

Materials and methods

This meta-analysis complied with the Preferred Reporting Items for Systematic Review and Meta-analysis of Diagnostic Test Accuracy Studies (PRISMA-DTA) guidelines [21]. Therefore, the requirement for approval from the institutional review board was waived.

Literature search

We searched the PubMed, Embase, Cochrane Library, and Web of Science databases through August 2, 2023, with an English restriction applied. The search strategy consisted of (“pancreatic” OR “pancreas”) AND (“cancer*” OR “carcinoma*” OR “malignan*” OR “neoplasm*” OR “adenocarcinoma*”) AND (“Pancreatitis”) AND (“Magnetic Resonance Imaging” OR “magnetic resonance” OR “MRI” OR “MRCP” OR “MRE” OR “diffusion-weighted” OR “DWI”) OR (“Diagnosis, Differential” OR “Sensitivity and Specificity” OR “Predictive Value of Tests” OR “diagnostic performance” OR “differentia*” OR “sensitiv*” OR “specificit*” OR “accurac*” OR (“predictive” AND “value”)). The detailed search strategy is presented in Supplementary Table 1. The reference lists of the relevant studies were also checked manually to identify additional eligible studies.

Study selection

After removing duplicates, two reviewers assessed the relevance of the studies based on the titles and abstracts to remove irrelevant studies. Next, the full-text manuscripts were retrieved. In this study, we strictly distinguished between the two concepts for clarity: MFP is a type of disease based on chronic pancreatitis (CP) and has no background of AIP, AIP is a distinct disease with no background of MFP, and IPM represents either or a combination of the two. Studies that fulfilled the predefined inclusion and exclusion criteria were ultimately assessed as eligible. The inclusion criteria were as follows: (a) patients with IPM or PC, (b) quantitative MR imaging biomarkers, (c) diagnostic performance for distinguishing between IPM and PC, and (d) both observational (retrospective or prospective) and clinical trials. The exclusion criteria were as follows: (a) insufficient data: inability to create a 2 × 2 contingency table or number of studies for the same MR imaging biomarker was less than two; (b) less than five patients; (c) mixed patient group for the diagnostic accuracy assessment of ADC, Dslow, and Dfast; (d) outlier: the direction of cutoff value of which was opposite to all the other articles of similar or larger patient numbers; (e) other types of articles, such as case reports, review articles, comments, letters, conference abstracts, editorials, or articles published in non-English languages; and (f) articles not available. Disagreements between the reviewers were resolved through discussion to reach consensus.

Data extraction

Data were extracted independently by two reviewers using a standardized form. Discrepancies were resolved by double-checking and discussion to reach consensus. The predesigned data extraction table consisted of four aspects of information: (a) study characteristics: the first author’s name, publication year, country, key conclusions, study design, reference standard, interval between imaging and reference standards, and blinding to the reference standard; (b) patient characteristics: enrollment, number of patients or lesions, age, disease type, and AIP subtype; (c) MRI parameters: MRI unit, sequence, magnetic field strength, slice thickness, field of view, region of interest (ROI) size, number of b and b values (s/mm2), repetition time (TR) (ms), and echo time (TE) (ms); (d) outcomes: true positive, false positive, false negative, and true negative numbers and the cutoff values used for differential diagnosis. If the ADC values were based on multiple b values, the raw data of the ADC value with the highest sensitivity and specificity were extracted. If there were several readers for the same MR imaging biomarker, the sensitivity and specificity obtained by the reader with the highest accuracy were extracted. In addition, the original authors were contacted in the case of insufficient information.

Quality assessment

Quality assessment was performed using Quality Assessment of Diagnostic Accuracy Studies 2 (QUADAS-2) [22]. Two reviewers independently assessed the risk of bias and applicability concerns in the following four domains: patient selection, index test, reference standard, and flow and timing. Disagreements were resolved through discussions to reach consensus. We did not assign scores to judge the quality of the included studies because each domain of QUADAS-2 had a different impact on the study results. Quality assessment forms were plotted using RevMan version 5.4.

Statistical analysis

The DerSimonian-Laird method (random effects model) was used for the meta-analysis. In cases where the cells were devoid of data, a correction of 1/2 was added to each cell. Forest plots were developed to visualize the pooled sensitivity and specificity with a 95% CI. The statistical heterogeneity of the included studies was assessed using the Cochrane Q test and Higgins I2 test. The Cochrane Q test (p < 0.05) indicated the presence of statistical heterogeneity. An I2 value between 20 and 50% indicated moderate heterogeneity, whereas an I2 value of > 50% indicated substantial heterogeneity. Moreover, Spearman’s rank correlation test was conducted to judge the presence of threshold effects. Because the number of studies for each biomarker was less than 10, the assessment of publication bias was considered unnecessary. Univariate meta-regression analyses were performed using predesigned subgroups to determine potential factors influencing heterogeneity. All statistical analyses were conducted by a single trained researcher (Z.H.W.) using Meta-DiSc version 1.4 and Stata version 15.0.

Results

Literature search and study selection

A flowchart of the literature screening process is shown in Fig. 1. A total of 1819 studies were initially retrieved. After removing 685 duplicates, a further 1030 studies were excluded because of obvious irrelevance. Next, 80 studies were excluded by the full-text review: four articles were not available; 13 articles did not involve a direct comparison of PC and any kind of IPM; eight articles did not report the diagnostic performance of quantitative MR imaging biomarkers; and 39 were other types of articles, as specified in the “Materials and methods” section. Fifteen articles were excluded due to insufficient data. Among the 15 articles, 14 were not able to construct 2 × 2 contingency tables and one reported the maximal diameter of the upstream main pancreatic duct (dMPD) with a cutoff value other than 4 mm or 5 mm for the differential diagnosis of AIP and PC [23]. Twenty-five studies were initially included for data extraction. After comparing the direction of cutoff values for MR imaging biomarkers (higher defined as IPM or lower defined as IPM), the direction of cutoff value for ADC values from one study showed obvious deviation from all the other studies which were further excluded to ensure the stability of the study results [24]. Twenty-four studies were included in the final meta-analysis [25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48].

Fig. 1
figure 1

Flowchart of the study selection process for the meta-analysis

During the workflow of study selection and article review, we encountered a paradoxical aspect in the study results concerning the differentiation between IPM and PC using ADC values. Specifically, for the differential diagnosis of MFP and PC, all studies diagnosed MFP with an ADC value higher than the cutoff value, whereas for the differential diagnosis of AIP and PC, all studies diagnosed AIP with an ADC value lower than the cutoff value. Accordingly, the two IPMs were analyzed separately for the ADC values.

Data extraction and quality assessment

The characteristics of the included studies are summarized in Table 1. The study of Klauss et al [49] reported three sensitivities and specificities based on different b values, and we extracted the one with the highest sensitivity and specificity, which was b = 200 s/mm2. Figures 2 and 3 show the results of the quality assessment of the included studies using QUADAS-2. Among the included studies, 18 studies were retrospective, and six were prospective. For patient selection, eight studies were marked as having a high risk of bias because of their case-control design. Eighteen studies, reporting the diagnostic efficacy of the ADC, f, or mass stiffness, in which the cutoff values in 18 studies were all determined by the point of the largest Youden index or the point providing the balance between sensitivity and specificity on the ROC curves, were therefore marked as having an unclear risk of bias in the domain of the index test. For reference standards, 16 studies were marked as having an unclear risk of bias due to the absence of information on whether the researchers were blinded to the results of the index tests during the implementation of the reference standard. As five studies did not include all patients in the final analysis for various reasons (such as suboptimal imaging quality, imaging failure, incomplete imaging workup, and complications of other diseases), they were marked as high risk in the domain of flow and timing. Furthermore, nine studies were labeled with an unclear risk of bias concerning the unclear intervals between the index tests and the reference standards, as these details were not reported. There were no concerns regarding the applicability of the included studies.

Table 1 Characteristics of initially included studies
Fig. 2
figure 2

Methodological quality assessment graph of included studies using QUADAS-2: the risk of bias and applicability concerns about each domain presented as percentages across included studies

Fig. 3
figure 3

Methodological quality assessment summary of included studies using QUADAS-2: the risk of bias and applicability concerns about each domain presented as low, unclear, or high risk

Diagnostic performance of MR imaging biomarkers

The diagnostic performances of the quantitative MR imaging biomarkers are shown in Table 2. The study results of using ADC values to differentiate IPM from PC were controversial, but we noticed that the MFP vs. PC and AIP vs. PC groups had opposite directions concerning the cutoff value. The mean or median ADC values of the IPM and PC in individual studies are shown in Fig. 4. In the MFP vs. PC group, masses exhibiting higher ADC values were identified as MFP, while those with lower ADC values were diagnosed as AIP. This distinction underscores the importance of not conflating these two types of IPM in DWI assessment. Hence, the diagnostic efficacy of the ADC values was pooled separately for the two IPM subgroups.

Table 2 Diagnostic performance of quantitative MR imaging biomarkers for differentiating IPM from PC
Fig. 4
figure 4

Mean or median ADC value in individual studies

In the differentiation between MFP and PC, the ADC value had a sensitivity of 0.80 (95%CI 0.61–0.92) and a specificity of 0.85 (95%CI 0.77–0.92). For distinguishing AIP and PC, the ADC value had a sensitivity of 0.82 (95%CI 0.75, 0.87) and a specificity of 0.84 (95%CI 0.81, 0.87). Additionally, for distinguishing IPM and PC, both f and mass stiffness showed relatively high sensitivities (0.82, 95%CI 0.68–0.91; 0.82, 95%CI 0.71–0.91) with moderate specificities (0.68, 95%CI 0.61–0.74; 0.77, 95%CI 0.68–0.84). To distinguish between AIP and PC, dMPD ≤ 5 mm demonstrated the highest sensitivity (0.97, 95%CI 0.93–0.99) and diagnostic odds ratio (46, 95%CI 17–126) for distinguishing AIP from PC, but the specificity was relatively low (0.52, 95%CI 0.47–0.58). Thus, dMPD > 5 mm may be considered an ideal cutoff value for excluding AIP.

Forest plots of the pooled sensitivity and specificity are shown in Supplementary Figures 13. For the diagnostic performance of all imaging biomarkers, the Cochrane p-values indicated the presence of heterogeneity and the I2 values showed high heterogeneity among most studies. The lowest heterogeneity was not only found in sensitivity of f and specificity of stiffness for distinguishing IPM and PC (p = 0.391, I2 = 4.1%), but also found in specificity of dMPD ≤ 4 mm and sensitivity of dMPD ≤ 5 mm for distinguishing AIP and PC. The Spearman correlation coefficient (Supplementary Table 2) revealed a considerable threshold effect for distinguishing IPM and PC using mass stiffness and differentiating MFP from PC using ADC values. No significant threshold effects were found for the diagnostic performance of other MR imaging biomarkers. Furthermore, univariate meta-regression analysis was conducted to identify other potential sources of heterogeneity.

Univariate meta-regression analysis

Univariate meta-regression analysis was performed to identify the potential factors influencing heterogeneity among the studies, and the results are shown in Table 3. For distinguishing AIP and PC, the sensitivity (0.85, 95%CI 0.77–0.93) of ADC values with high field strength (3.0T or mixed) outperformed the sensitivity (0.75, 95%CI 0.61–0.89) of ADC values with low field strength (1.5T). The ADC values using two b values showed higher sensitivity (0.86, 95%CI 0.80–0.93) than the ADC values using multiple b values (0.66, 95%CI 0.50–0.83), but no significant heterogeneity was observed for the specificity. The pooled specificity of the dMPD ≤ 5 mm was higher in studies conducted in Japan (0.86, 95%CI 0.72–1.00) than studies conducted in other countries (0.45, 95%CI 0.24–0.64), whereas the specificity did not exhibit significant heterogeneity. Owing to the limited number of studies, we did not conduct a subgroup analysis of other quantitative MR imaging biomarkers.

Table 3 Results of univariate meta-regression: sources of heterogeneity

Discussion

In this review, relevant studies on quantitative MR imaging biomarkers for distinguishing between IPM and PC were systematically reviewed. Thereafter, we conducted an analysis to assess the combined diagnostic performance of four quantitative MR imaging biomarkers in distinguishing between IPM and PC, ADC, f, mass stiffness, and dMPD. All these biomarkers had relatively high diagnostic efficacy, among which the ADC value and dMPD represented the most commonly used MR imaging biomarkers with the best availability.

The ADC values showed distinct manifestations in MFP and AIP, which might be related to their respective pathological characteristics. Notably, although MFP and AIP are both special forms of CP that exhibit “mass-like” appearances, they are different entities with distinctive pathological features [50]. The main pathological changes in MFP include varying degrees of fibrosis and irreversible glandular atrophy, which may sometimes be accompanied by dilatation of the main pancreatic duct [51]. Conversely, AIP is histologically characterized by extensive periductal infiltration of numerous lymphocytes and IgG4-positive plasma cells, along with features such as obliterative phlebitis, storiform fibrosis, and inflammatory infiltrates [50, 52,53,54]. Because of the highly organized microstructure and high cellularity, the diffusion of water molecules may be relatively more restricted in AIP. Thus, MFP and AIP may produce different results for differential diagnosis using DWI- and IVIM-derived parameters. More studies have reported that the ADC values of MFP were significantly higher than those of PC [16, 44, 45, 55,56,57]. Meanwhile, we noticed that for the four studies that explicitly included AIP in the MFP group [14,15,16,17], the ADC values in the MFP group tended to decrease significantly as the proportion of AIP in the MFP group increased. With an increase in the proportion of AIP, the ADC values in MFP changed from significantly higher to no significant difference, and then to significantly lower than those in PC. Moreover, the meta-analytic pooled sensitivity and specificity of higher ADC values for the diagnosis of MFP were notably robust. Fortunately, the clinical features of MFP and AIP differ [10, 58]. Therefore, when using ADC values to perform the differential diagnosis of IPM and PC, it is important to first determine the subtype of IPM according to the clinical characteristics.

Different numbers and magnitudes of b values influence the calculation of ADC values. For the images obtained with low b values, the image contrast mainly depends on the T2 signal, without reducing the signal-to-noise ratio (SNR) and contrast-to-noise ratio (CNR). With the increase of b values, the detection of diffusion impediment is more sensitive [11]. Additionally, SNR is influenced by the magnitude of field strength. One study also revealed the existence of intervendor differences when different field strengths (1.5T or 3T) were applied [59]. Due to the heterogeneity of tumors, different ROI delineation methods may also affect ADC calculation. For the variation of the above-mentioned MRI parameters, there is a lack of consistency in ADC values among included studies. Future studies are needed to establish a standardize DWI acquisition method.

For IVIM-derived parameters, all included studies on the f showed that the values of f were higher in IPM than in PC. However, for other parameters, studies with different IPM types had the opposite conclusions. Kang et al used the bi-exponential model for the calculation of Dslow and Dfast, and reported that the median values of Dslow and Dfast in MFP were higher than those in PC, although there was no significant difference in Dslow values between the two groups [38]. Kim et al used a mono-exponential model for the calculation and reported that both the median Dslow and Dfast values of AIP were lower than the median values of PC, whereas no significant difference was observed [35]. In summary, considering the overall trend of previous study findings and the different pathological features of the two diseases, we suggest that MFP and AIP should not be confused when distinguishing IPM and PC with Dslow and Dfast. However, for it is more complex and time-consuming than DWI, and high degrees of variability existed due to the application of different models, and there is lack of strong evidence to confirm the clinical application value, implementing IVIM analysis remains challenging.

When applying the dMPD to distinguish between IPM and PC, it is necessary to first determine the subtype of the IPM. Differentiating between MFP and PC is likely to be difficult because the values often overlap. However, the dMPD performed well in the differential diagnosis of AIP and PC [60]. With the cutoff value of 5 mm, the dMPD has the highest sensitivity, which indicates that dMPD can help exclude patients with AIP.

For the mass stiffness, all included studies consistently showed that the values of mass stiffness were higher in the IPM than in the PC. MRE of the pancreas is a novel, non-invasive imaging technique with high reliability and reproducibility. However, MRE of the pancreas is a breath-hold sequence, and its clinical application remains limited owing to the long examination time, high cost, and lower spatial and temporal resolution than routine MRI [61, 62]. Moreover, the small number of included studies might weaken the results of the meta-analysis. Consequently, development of this technique, standardization of image acquisition, and studies with larger sample sizes are needed to better solve these clinical problems.

This meta-analysis had several limitations. First, the cases of IPM in single-center studies were limited, and a large amount of data presented a non-normal distribution; therefore, there were risks of data drift and bias. Second, some studies did not explicitly report the composition of the patient group classified as MFP. Consequently, a senior abdominal radiologist carefully read the full text and compared the clinical and laboratory findings presented in the text with the diagnostic criteria of AIP to determine whether the MFP group included AIP. Third, because the values of these MR imaging biomarkers vary significantly with different MRI scanners, field strengths, b values, or methods for ROI delineation, it remains difficult to obtain pre-specified thresholds for differential diagnosis.

In conclusion, quantitative MR imaging biomarkers performed well in the differential diagnosis of IPM and PC. Compared with PC, the ADC values were higher in MFP but lower in AIP. The ADC can distinguish PC from MFP or AIP, both with high diagnostic performance, as long as the two IPM types are not mixed together. dMPD can help exclude patients with AIP, especially those with a cutoff value of 5 mm.