Incidental focal pancreatic cystic lesions (FPCL) are present in up to 20% of all abdominal MRI examinations and the incidence increases to 40% in patients older than 70 years of age [1]. The large majority of these incidental cysts are benign side-branch intraductal pancreatic mucinous neoplasms (IPMN); however, a fraction can be dysplastic or malignant [24].

Several management guidelines have been developed for incidental pancreatic cystic lesions [57]. These guidelines employ a combination of single length size cutoffs, growth on follow-up exams, presence of symptoms, and imaging characteristics to direct management decisions including frequency of follow-up imaging, invasive testing such as endoscopic ultrasound (EUS) guided fine needle aspiration and resection. None of these guidelines, however, include standards for the measurement of pancreatic cystic lesions nor do they define what constitutes growth on follow-up.

An additional challenge lies in the configuration of FPCLs—they are often pleomorphic in shape, clustered, or multiple which can make consistent measurements on serial follow-up exams difficult (Fig. 1). Furthermore, when interpreting MRI with magnetic resonance cholangiopancreatography (MRCP), the radiologist must choose between axial or coronal, T1- or T2-weighted, MRCP or post-contrast sequences to measure the lesion (Fig. 2). Significant variability has been reported when measuring FPCLs between different imaging modalities [8]. These factors and the lack of measurement standards result in a potential for significant inter-observer variability in the measurement of pancreatic cystic lesions.

Fig. 1
figure 1

A 67-year-old asymptomatic female with clustered cystic lesions in the pancreas. Measured size of this cluster of cysts varies dramatically depending on whether the largest individual cyst is measured (white arrow) or if the cluster is measured as a single lesion (black arrow)

Fig. 2
figure 2

A 61-year-old male with an incidentally discovered FPCL depicting variable size measurement depending on image selection and measurement technique. AC, axial T2 weighted image (A) measured the inner wall distance as 0.8 cm. Coronal MIP image from a 3D MRCP sequence (B) measured the inner wall distance as 1.6 cm. Coronal T2 weighted image (C) measured the outer wall distance as 2.0 cm

Therefore, the purpose of this study was to quantify inter-observer variability in the measurement of FCPLs on MRI and whether this variability could affect clinical management. We then introduced measurement standards and assessed their impact on measurement variability.

Materials and methods

Subjects

In this prospective, IRB-approved, HIPAA-compliant study, with waiver of informed consent, 144 MRI examinations containing at least one FPCL were selected from a database of over 700 MRIs with FPCLs acquired between 2002 and 2011. These exams were randomly selected from six size groups that bracket management decision points (10, 20, 30 mm) in the Sendai and ACR criteria [57]. The size of the cysts was determined independently at the time of inclusion in the larger database and this measurement otherwise did not contribute to the study. Power analysis required 24 exams in each size group (5–10, 11–15, 16–20, 21–25, 26–30 and 31–35 mm). Factors such as lesion morphology, imaging appearance, and patient characteristics were not considered in exam selection. Personally identifiable information (PII), previous measurements, and prior reports were stripped from the exams and they were uploaded onto the hospital picture archiving and communication system (PACS) (GE Medical, Waukesha, WI) with a unique study identifier.

Image acquisition

The only criterion for inclusion of an exam in the study was the presence of a FPCL. As such, MRI exams included in the study reflect common clinical practice with exams acquired on multiple magnets from multiple vendors at 1.5 and 3.0 Tesla field strengths in a tertiary care hospital, community hospital, and imported from outside institutions. Protocols were variable depending on the clinical indication, location of study, and date of the study. Use of oral and intravenous contrast and MRCP sequences in the study exams was also variable.

Image analysis

Four reviewers, two attending abdominal imagers, specializing in body MRI, each with 5 years of post-fellowship experience (M.S. and K.S.L.) and two abdominal imaging fellows (G.R. and S.J.) measured each FPCL twice—once according to their regular clinical practice and again 12 weeks later, after the introduction of, and training with, measurement standards. Greatest single length measurement was used according the RECIST criteria for both measurements [9]. The interval between measurement sessions was intended to reduce recall bias. Greatest single length measurement, series number, and image number were recorded on an electronic worksheet that indicated the anatomic location of each cystic lesion in the pancreas but without image number or previously measured size (i.e., posterior aspect of the uncinate process). Prior measurements before standards introduction were not available during the second round of measurements. Each reviewer performed measurements independent and separate of other reviewers and was blinded to any previous measurements. All exams series and images were available to the reviewers.

Measurement standards

We identified two intermediate objectives to reduce variability between reviewers—choosing the same image for measurement and using the same measurement technique—and the standards were designed to accomplish these goals. Measurement standards (Fig. 3) were introduced in a group training session where examples of the standardized measurement approach were shown and questions could be asked and discussed as a group to ensure identical training for each reviewer.

Fig. 3
figure 3

Measurement standards for pancreatic cysts on MRI provided to reviewers in a group training after the first round of measurements (MIP = maximal intensity projection)

Statistical analysis

Within-subject standard deviation was used as the principal measure of variability. Variability was also assessed using intraclass correlation coefficient (ICC) for overall absolute agreement. Sub-groupings for statistical analysis were based on median measured size. Overall and pairwise Kappa (κ) was used to assess agreement on management category. Agreement values were interpreted according to the following scale: 0.01–0.20, slight agreement; 0.21–0.40, fair agreement; 0.61–0.80, substantial agreement; 0.41–0.60, moderate agreement; and 0.81–1.0, excellent agreement [11]. Lesions where two reviewer measurements differed by more than 10 mm or 50% were considered at risk for erroneous reporting of growth.

Z test, χ 2 test, and McNemar’s test were used for comparisons. The level of statistical significance was set at p = 0.05. Statistical analysis was carried out using Matlab (Mathworks, Natick, MA).

Results

Size measurements

Figure 4 shows the distribution of all 144 measurements by each reviewer before (Fig. 4A) and after (Fig. 4B) the introduction of measurement standards. There is initially a systematic bias towards smaller measurements seen in one of the readers (Fellow 2, cyan, Fig. 4A) that disappears after introduction of standards (Fellow 2, cyan, Fig. 4B). Clear outliers are seen in both datasets which importantly impact the resulting data.

Fig. 4
figure 4

AB, Distribution of all measurements, color-coded by reviewer, for all 144 FPCLs before (A) and after (B) the implementation of measurement standards. The cysts are sorted along the x-axis by the median of the 4 measurements, which is also shown as thin black line

The mean within-subject standard deviation (SW), a measurement of the variability for each lesion measured, decreased in all subgroups after introduction of standard measurements, improving from 4.0 to 3.3 mm (Table 1, p < 0.01). SW was noted to increase with size of the lesion being measured (Fig. 5).

Table 1 Variability for measurements of each FPCL before and after introduction of standards as measured by the within-subject standard deviation (SW) and its standard error
Fig. 5
figure 5

Variability of FPCLs bracketed by 95% confidence intervals based on lesion size before and after the implementation of measurement standards

The measure of absolute size agreement, the ICC, increased from 0.81 to 0.86 (excellent agreement), though this result did not reach statistical significance (p = 0.20).

Management categories

Agreement between reviewers on management category (<10, 10–19, 20–29, ≥30 mm) as measured by κ was 0.59 ± 0.02 (moderate agreement) for measurements before introduction of standards and 0.67 ± 0.02 (substantial agreement) afterwards (p < 0.01). All 4 readers agree on the size category in 73 cases (51%) initially and in 86 cases (60%) after training (p = 0.06). Of the lesions with at least one measurement greater than or equal to 30 mm, a clinically significant cutoff, there was agreement on management category in 14/38 (37%) cases before training and 20/38 (53%) afterwards (p = 0.08).

Pairwise comparisons are seen in Table 2. There is initially highest agreement (κ = 0.69) on size category between the attending readers and least agreement (κ = 0.46) between the two fellows. Following introduction of standards, there was no difference in agreement between more (κ = 0.67) and less experienced readers (κ = 0.70).

Table 2 Overall and pairwise Fleiss κ comparisons between all reader pairs

Image selection

Table 3 shows the effectiveness of standards in encouraging readers to take measurement from the same sequences. All readers took measurements from the same series in 26/144 (18%) of cases initially and 105/144 (73%) after introduction of standards (p < 0.01). Table 4 shows pairwise comparison between readers choosing almost the same image, meaning an image on the same series within one slice of the other. Paired reviewers choose the same image (±1 slice) in an average of 57 out of 144 or 40% of cases before, and 114 out of 144 or 79% of cases after standards implementation (p < 0.01).

Table 3 Number of different MR image series used by reviewers for the 144 measurements before and after introduction of measurement standards
Table 4 Frequency of readers using the same image ±1 slice measuring 144 pancreatic cysts

Exams were not selected based on location of the FPCL within the pancreas; however, Table 1 shows variability was greater in the measurements of lesions in the head, neck, and uncinate process of the pancreas (SW = 4.42 ± 0.20) than in the body and tail (SW = 3.26 ± 0.17, p < 0.01). This difference persisted after introduction of measurement standards (SW = 3.75 ± 0.17 and 2.41 ± 0.13, p < 0.01).

Increase in size of a lesion ≥10 mm or ≥50% on follow-up has been used to define growth of a FPCL [10] and could trigger a change in management. Initially, in 62 of the 144 lesions (43.1%) at least two of the four readers had measurement discrepancies which met these criteria. After introduction of standards, only 30 of the 144 (20.8%) measured lesions met these criteria for change, a 51.6% reduction (p < 0.01).

Discussion

Quality, performance management, and process improvement have become a focus of radiology departments and national meetings over the last decade with the goal of improving value, consistency, and safety. One key area of quality improvement is identifying and eliminating variability in the diagnosis and management of disease and the consistent implementation of practice guidelines.

Significant variability has been shown when measuring FPCLs using differing modalities such as computed tomography (CT), MRI, and EUS [12]. There is also significant variability in the management recommendations made by interpreting radiologists [13, 14]. Inter-observer agreement using MRI in identifying and categorizing concerning features has been found to be especially low [15]. Do et al. also studied inter-observer variability in identification of concerning features and single length measurement specifically of IPMNs with CT [16]. To the best of our knowledge, however, inter-observer variability in FPCL size measurements on MRI has not been previously studied.

Measurement variability

The results of this study indicate that significant inter-observer variability exists in the measurement of focal pancreatic cystic lesions on MRI and the variability (as represented by the ICC) for FPCLs in the first phase of this study (ICC = 0.81) is similar to that reported for CT measurements of IPMNs (0.85) [15] and more than the variability reported for lung nodules (0.97), lymph nodes (0.93), and liver tumors (ICC = 0.95) [16].

Hopper et al. reported increased variability in the measurement of irregular tumors in the abdomen which certainly applies to FPCLs [17]. Sources of variability in this study included factors intrinsic to these lesions, such as irregular shape or clustered cysts, location of the lesion in the head of the pancreas, the image selected for measurement, and individual measurement techniques that lead to systematic over- or under-measurement. Maimone et al. found an average of 3–4 mm absolute size difference when comparing measurements acquired from at least 2 different modalities (CT, MRI, EUS) within a 90-day window [8]. They suggested the source of the variability was the different modalities. However, in this single modality study, the reported mean inter-observer within-subject standard deviation (SW) was 3.99 mm, which is similar to the inter-modality variability Maimone et al. reported.

De Jong et al. studied the inter-observer agreement using MRI in identification and categorization of concerning features in in pancreatic cysts [15]. Somewhat surprisingly, they found only poor to moderate agreement between four experienced radiologists for identifying concerning characteristics (ICC = 0.23–0.53), though the use of ICC for these categorical comparisons is questionable. Do et al. recently reported similarly limited inter-observer agreement using CT for characterization of IPMN subtype, presence of nodules, and solid components (κ = 0.28–0.46). They found much better agreement on continuous variables such as maximal dimension (ICC = 0.85), pancreatic duct size (ICC = 0.75), and common bile duct size (ICC = 0.61) [15]. Of note, the average size of the IPMNs in that dataset was 35 mm (11–83 mm), while the FPCLs studied in this study averaged 20 mm (4–54 mm). While agreement on absolute size measurements in our study (ICC = 0.81 and 0.86) was in a range similar to that reported by Do et al., our moderate to substantial agreement on assignment to management categories (κ = 0.59 and 0.67) along with the results by Do et al. on feature analysis points to the difficulty in consistent MRI evaluation of these lesions by different radiologists.

Analysis of the largest measurement discrepancies in our dataset showed that readers sometimes measured a different lesion than the one intended and measured by others. While specific anatomic descriptors were provided (i.e., pancreatic body, superior to the duct), the presence of multiple cystic lesions in the pancreas or even a segment can make consistent identification and repeated measurement of the same lesion difficult. A second source of large measurement discrepancies was clustered or septated cysts. Each reviewer had to decide whether to measure the lesion as a single unit or measure the largest component.

Location of the lesions within the pancreas also contributed to variability. Measurements of lesions within the head, neck, and uncinate process of the pancreas demonstrated significantly more inter-observer variability than those in the body and tail. This difference may be attributable to the change in orientation of the pancreas (horizontal in the body/tail vs. more vertical in the head) or increased complexity of the surrounding anatomy in the region of the pancreatic head and uncinate process.

Considering the effect of experience on inter-observer variability, on initial measurements, there was less inter-observer variability between the two most experienced readers, κ = 0.69, than between the two least experienced readers, κ = 0.46. This suggests that experience does contribute to variability in cyst measurement.

Contrary to results from McErlean et al. [17] as well as the 2010 ACR Consensus Statement [7], absolute variability increased with increasing lesion size as measured by the within-subject standard deviation. This difference may be partly attributable to the difficulties involved with measuring clustered cysts.

Clinical implications

Our data show that there is significant inter-observer variability in the measurement of FPCLs and that this variability could have a substantial impact on clinical management. All four reviewers agreed on the management size category for a given lesion only 51% of the time prior to standards training compared to 60% after training. Furthermore, the agreement between reviewers was worse for those lesions which approached the clinically important cutoff of 30 mm where guidelines recommend invasive testing [57].

Growth on follow-up scans is a critical image-based factor guiding management of FPCLs. The ACR statement explains that no consensus exists for what defines interval growth [7] and the revised Sendai guidelines only state follow-up interval can be lengthened if there is no change after 2 years [6]. Das et al. used a cutoff of 10 mm or 50% single length measurement increase from initial measurement to define growth [11]. In our study, measurement differences of ≥10 mm or 50% between any two of the four reviewers occurred in 62/144 exams (43.1%) before the introduction of standards.

Effects of standards

The use of measurement standards was successful in accomplishing the intermediate goal of homogenizing image selection. The frequency of all 4 readers choosing the same series for measurement increased from 18% to 73% and reader pairs choosing the same image ±1 slice increased from 39.5% to 79.1% after measurement standards training. Laxity in the measurement standards allowing reviewers to use an alternate plane if they found the coronal plane difficult to use likely contributed to the remaining discrepancy in image. Measurement of a nearby non-target lesion occurred regardless of the implementation of measurement standards, also contributing to discrepancies in image selection.

Measurement standards were also successful in reducing erroneous reporting of lesion change from 62/144 (43.1%) cases to 30/144 (20.8%) cases. While this rate remains high, it does indicate that measurement standards can be effective reducing unwarranted reporting of lesion change.

The use of measurement standards reduced inter-observer variation in assignment of lesions to size-based management groups as represented by the increase in κ (p < 0.01). The improvement in agreement on absolute size measurement (ICC) approached significance (p = 0.2).

The difference in κ observed in pairings between the two staff reviewers and the two fellow reviewers disappeared with the fellow pair increasing their agreement from κ = 0.46 to κ = 0.70 which was the second highest rate of agreement among the six reader pairings. The remaining inter-observer variability (ICC = 0.86) after implementation of measurement standards for FPCLs on MRI, as compared to the high agreement McErlean et al. [17] reported for CT measurements of liver masses and lymph nodes (ICC = 0.93–0.95), may be due to several factors including pleomorphic nature of FPCLs, incomplete acceptance of measurement standards by the readers, erroneous measurement of a different, non-target lesion, or clustered cystic lesions.

One of the limitations of this study is that unlike clinical practice, reviewers had no access to comparison images or prior reports. Frequently, radiologists use image references from prior reports to ensure similar measurements or simply re-measure the lesion on the comparison exam themselves. As such, the variability we report may be exaggerated compared to results in actual practice. While not the focus of this study, this limitation does highlight the utility of including an image reference in reports or saving measurement markings when interpreting studies containing FPCLs.

Another limitation of the study is the shared background of the reviewers. The two attending reviewers trained at the same institution for both residency and fellowship. This common background could have led to a more standardized approach which could have artificially reduced variability compared to reviewers with more varied training backgrounds. Also, the 4 reviewers in this study may not be representative of radiologists elsewhere in their initial measurement approach or acceptance of standards.

The exams in the study came from many different institutions over a long period with different techniques and image quality which may affect measurement variability, but reflects our practice of reviewing outside images and old comparison studies.

Conclusion

This study shows that there is significant inter-observer variability in the measurement of pancreatic cystic lesions which can crucially affect clinical management. Larger lesion size, location of the lesion in the pancreatic neck, head, or uncinate process, and less reader experience all increased measurement variability in this study.

The measurement standards introduced in this study were effective in homogenizing image selection, reducing unwarranted reporting of change, reducing variability attributable to reader experience, and improving agreement on management categories.

Based on these findings, we recommend that future management guidelines of pancreatic cystic lesions should include standards for measurement and define parameters that constitute growth. The measurement standards used in this study were effective and could be used until further research further validates these standards or identifies more effective standards.