Introduction

Gallbladder (GB) cancer is the most common malignant tumor of the gallbladder. Although GB cancer is a relatively rare neoplasm, some countries in Asia, Latin America, and Eastern Europe show a high incidence rate [1, 2]. GB cancer is often diagnosed at an advanced stage, because it presents as asymptomatic or nonspecific symptoms in the early stages and lacks a submucosal layer. However, the use of ultrasonography and computed tomography (CT) has increased the detection of early-stage GB cancer [3].

Complete resection with a negative margin is the only curative treatment for early-stage GB cancer [4]. The extent of surgical resection is usually determined according to the T-stage. Simple cholecystectomy is sufficient for the treatment of T1a GB cancer, and extended cholecystectomy with lymph node dissection is the standard treatment for T2 GB cancer. However, controversy regarding the optimal surgery for T1b GB cancer remains [4, 5]. The National Comprehensive Cancer Network guidelines and the European Society for Medical Oncology guidelines recommend cholecystectomy with liver resection (wedge resection or segment IVb and V) for T1b-stage GB cancer [6, 7]. However, several studies have shown comparable survival outcomes in patients with T1b GB cancer between simple and extended cholecystectomy [8,9,10,11,12,13]. In institutions and countries that perform simple cholecystectomy as a treatment for T1b GB cancer, precise preoperative differentiation between T1- and T2-stage GB cancers can provide surgeons with an appropriate surgical plan to decrease the risk of unnecessary over-resection or reoperation rates.

Several studies have investigated the CT criteria and diagnostic accuracy for the T-stage of GB cancer [14,15,16]. However, these studies were intended to evaluate the diagnostic performance of all T-stage GB cancers, and the study population of each study was not sufficiently large. In addition, there are discordant results in diagnostic performance between these studies, and only one study evaluated the inter-observer variability.

Therefore, the purpose of our study was to evaluate the diagnostic performance and inter-observer variability of differentiating between T1- and T2-stage GB cancers using multi-detector row CT (MDCT).

Materials and methods

This retrospective, single-center study was conducted at a tertiary referral academic center. The Institutional Review Board approved this study and waived the need for written informed consent. The study was conducted in accordance with the Declaration of Helsinki.

Patients

A total of 218 consecutive patients who had undergone surgery for GB cancer between January 2016 and September 2020 were identified through a search of our institution’s surgical database. Among them, 162 patients were included according to the following inclusion criteria: (a) patients with T1 or T2 GB cancers and (b) patients who had undergone MDCT within 6 weeks before surgery. Eleven patients were excluded from the study due to the absence of multiplanar reformation (MRP) images (n = 6) or poor image quality (n = 5). A total of 151 patients (81 women and 70 men; mean age, 66.6 years ± 10.4) were included in the study (Fig. 1).

Fig. 1
figure 1

Inclusion flowchart

Image acquisition

Various MDCT scanners were used: Lightspeed VCT, Discovery CT750 HD, and Revolution Frontier of GE Healthcare (Milwaukee, WI, USA), Somatom Definition Flash and Somatom Force of Siemens Healthcare (Erlangen, Germany), and Aquilion and Aquilion One of Canon Medical Systems (Otawara, Japan). The CT protocol for the evaluation of biliary disease consists of three phases (unenhanced, arterial, and portal venous phases). Arterial and portal phases were obtained 45 and 70 s after the injection of iodinated contrast agent (typically 120 mL of 300 mgI/mL at 4 mL/s). The scanning parameters were as follows: tube voltage, 120 kV; tube current, 120–350 mA; and pitch factor, 0.984–1.375. Automated tube current modulation was used. The slice thickness for axial images was 3 or 3.75 mm, and the slice thickness for coronal and sagittal images was 3 mm. Scan coverage was from the right liver dome to the third portion of the duodenum for unenhanced and arterial phases and from the xiphoid process to the pubic symphysis. Coronal and sagittal MPR images of portal phase images were subsequently generated.

Image review

CT image review was independently performed by five radiologists (two experienced radiologists, with 10 and 11 years of experience in abdominal imaging, and three less experienced radiologists, with 2–6 years of experience in abdominal imaging). Before reviewing the CT images of patients enrolled in this study, previous studies on the staging of gallbladder cancer were shared and reviewed [14, 15]. The reviewers used the criteria suggested by Kim et al. to differentiate between T1 and T2 GB cancers, which are as follows: T1, polypoid lesions without focal thickening of the gallbladder wall, nodular or flat lesions with mucosal enhancement or focal thickening of the inner enhancing layer of the gallbladder wall with clear low-attenuated outer wall; T2, nodular or sessile lesions associated with focal thickening of the gallbladder wall at what was considered to be attachment sites, diffuse wall thickening with heterogeneous enhancement, diffuse wall thickening with strong thick inner wall enhancement and weak enhancement of the outer layer, focal wall thickening with outer surface dimpling at the tumor base [15]. As a training session, CT images and information regarding the T-stage of 40 patients (20 T1 cancers and 20 T2 cancers) who were not included in this study were reviewed by all reviewers.

In the reading session, the reviewers were informed of the purpose of this study. However, the reviewers were blinded to the clinical information, including the staging. The reviewers evaluated the CT images using a picture archiving and communication system (PACS; Centricity Radiology RA 1000; GE Healthcare, Chicago, IL, USA). The reviewers independently evaluated the T-stage of gallbladder cancers using a six-point scale as follows: 1, definitely T1; 2, probably T1; 3, T1 more likely than T2; 4, T2 more likely than T1; 5, probably T2; 6, definitely T2. In separate sessions, two out of five reviewers classified the radiologic morphology of the tumors into the following three categories independently, as either: polypoid, polypoid with focal wall thickening, or wall thickening type (Fig. 2). After independent image review of the tumor morphology, a consensus was reached if there were discordant results between reviewers.

Fig. 2
figure 2

Tumor morphologic type. a Polypoid type, b polypoid with wall thickening type. c Wall thickening type

Clinical data acquisition

The following information was obtained by reviewing the electronic medical records of our institution: age, sex, carbohydrate antigen 19-9 (CA19-9), type of surgery, and pathologic T-stage.

Statistical analysis

Under the assumption that the proportion of accuracy under the null hypothesis (P0) is 0.85, a sample size of 148 achieved 80% power to detect a difference of -0.10 using a two-sided Z-test with a significance level (α) of 0.050.

The diagnostic performance of each reviewer using the six-point scale was assessed using receiver operating characteristic (ROC) curve analysis, and the area under the ROC (AUC) was calculated. In addition, we calculated the sensitivity, specificity, accuracy, positive predictive value, and negative predictive value based on a binary scale. 1, 2, and 3 of the six-point scale were assumed to be T1, and 4, 5, and 6 were assumed to be T2. The T1-stage was defined as positive. To obtain the overall diagnostic performance of all radiologists, staging was determined according to the results of a majority of the reviewers. Inter-observer agreement for T-staging based on the binary scale of T-stage was assessed using Fleiss κ statistics. Inter-observer agreement between the two reviewers for the morphologic tumor type was evaluated using κ statistics. Kappa value interpretations were as follows: κ ≤ 0.2, poor; 0.2 < κ ≤ 0.4, fair; 0.4 < κ ≤ 0.6, moderate; 0.6 < κ ≤ 0.8, substantial; 0.8 < κ ≤ 1.0, excellent. The inter-observer agreement for T-staging was compared between radiologists’ experience using two independent sample z tests on the basis of asymptotic normality. The diagnostic performance (sensitivity, specificity, and accuracy) for T-staging was compared between radiologists’ experiences by analysis using generalized estimating equations (GEE). A two-sided P-value less than 0.05 was considered to be statistically significant. Statistical analysis was performed using SAS version 9.4 (SAS Institute Inc., Cary, NC, USA).

Results

The baseline characteristics of the patients are summarized in Table 1. Our study population consisted of 49 (32%) patients with T1 GB cancers and 102 (68%) patients with T2 GB cancers.

Table 1 Demographics of Enrolled Patients (n = 151)

Diagnostic performance

The AUC, sensitivity, specificity, accuracy, PPV, and NPV of each reviewer for T-staging are shown in Table 2 and Fig. 3. The AUC ranged 0.69–0.80. The overall accuracy of the five radiologists was 78%. The sensitivity was higher for experienced reviewers (P < 0.001), while specificity was higher in less experienced reviewers (P < 0.001).

Table 2 Diagnostic performance in T-staging of T1 and T2 gallbladder cancers
Fig. 3
figure 3

ROC curve analysis for T-staging

Inter-observer agreement for T-staging

The results of the inter-observer agreement for T-staging are shown in Table 3. For the assessment of T-stage, the inter-observer agreement of all reviewers showed fair reliability (κ = 0.36; 95% confidence interval: 0.31, 0.41). Although inter-observer agreement of the more experienced reviewers was higher than that of the less experienced reviewers, the difference was not statistically significant (κ = 0.54 vs 0.38, P = 0.093).

Table 3 Results of Inter-observer Agreement for T-staging

Analysis according to morphologic tumor type

According to the radiologic morphology, 41 (27.2%), 40 (26.5%), and 70 (46.4%) tumors were classified as polypoid, polypoid with wall thickening, and wall thickening type, respectively. The inter-observer agreement of the two reviewers for the classification of morphologic tumor type was substantial (κ = 0.70).

The overall accuracies for the T-stage were 63%, 78%, and 87% for polypoid, polypoid with wall thickening, and wall thickening type, respectively. In the polypoid type, the sensitivity and accuracy were higher (P < 0.001 and P = 0.002, respectively), and the specificity was lower in experienced reviewers (P = 0.009) (Supplementary Figure S1–S3). In polypoid with wall thickening type, the sensitivity showed no difference (P = 0.823), but the specificity and accuracy were lower for experienced reviewers (P < 0.001 and P = 0.012, respectively). In the wall thickening type, the sensitivity, specificity, and accuracy were not different between experienced reviewers and less experienced reviewers (P = 0.141, P = 0.062, and P = 0.252, respectively). The kappa values for the T-stage of the five reviewers were 0.07 (− 0.03, 0.16), 0.25 (0.16, 0.36), 0.32 (0.25, 0.40) for polypoid, polypoid with wall thickening, and wall thickening type, respectively (Table 3).

Analysis of incorrect T-staging cases and poor inter-observer agreement cases

Seven patients with T1 GB cancers were incorrectly staged as T2 by all five reviewers. Four patients had polypoid wall thickening-type GB cancers, and three patients had GB cancers with wall thickening (Fig. 4). There were no patients in whom T2 GB cancers were incorrectly staged as T1 by all five reviewers (Fig. 5).

Fig. 4
figure 4

A case with good inter-observer agreement between reviewers but inaccurate T-staging. The case was a 71-year-old man with GB cancer, with a tumor of the wall thickening type. All reviewers staged the tumor as T2, but the pathologic results were T1a. Axial a and coronal, b images

Fig. 5
figure 5

Cases of poor inter-observer agreement for T-staging. a A 70-year-old man with polypoid-type T1a GB cancer. Three reviewers staged T1 and two reviewers staged T2. Axial image. b 64-year-old man with polypoid-type T2 GB cancer. Three reviewers staged T2, and two reviewers staged T1. Axial image. c A 61-year-old man with polypoid wall thickening-type T1a GB cancer. Three reviewers staged T1 and two reviewers staged T2. Coronal image. d A 50-year-old woman with polypoid wall thickening-type T2 GB cancer. Three reviewers staged T2, and two reviewers staged T1. Coronal image

There were 43 patients with poor inter-observer agreement (this refers to a patient in whom the results of five reviewers for T-staging were divided by 2:3). Among 43 patients, 24, 9, and 10 patients were polypoid, polypoid with wall thickening, and wall thickening type, respectively. Poor inter-observer agreement was seen in 59% (24/41) of patients with polypoid type, 23% (9/40) of patients with polypoid with wall thickening, and 14% (10/70) of patients with wall thickening type, and the proportion of patients with poor inter-observer agreement was different according to the morphologic type (P < 0.001) (Table 4).

Table 4 Diagnostic Performance in T-Staging of T1 and T2 Gallbladder Cancer according to morphologic type

Discussion

Recent studies have demonstrated that simple cholecystectomy has similar recurrence and survival rates to radical cholecystectomy for T1b GB cancer [8]. Based on the results of these studies, it is important to distinguish T1- and T2-stage GB cancers preoperatively. Our study showed that the diagnostic performance of radiologists for differentiating T1 and T2 GB cancers using MDCT was a median AUC of 0.77 an overall accuracy of 78%, and the inter-observer agreement between radiologists was fair (κ = 0.36). In our results, the diagnostic performance and inter-observer agreement were not satisfactory.

In previous studies, the diagnostic performance for T-staging using CT has been reported. Yoshimitsu et al. reported an overall accuracy for T1-stage GB cancer of 86%, while Kim et al. reported 94% [15, 16]. Compared to the results of previous studies, the overall accuracy for T-staging was low at 78% in our study. Even the radiologist with the highest accuracy had an accuracy of only 80%. The diagnostic performance for T-staging can be influenced by the experience level of the radiologists and the characteristics of the tumors included. To minimize the effect of these factors on the results, radiologists with various experience levels were included in our study. In addition, we enrolled a sufficient number of patients to minimize the random error resulting from the small sample size.

Diagnostic performance was affected by the experience level of radiologists. Experienced radiologists showed slightly higher AUCs for T-staging than less experienced radiologists (0.79 and 0.80 vs 0.69, 0.77, 0.73, respectively). However, the accuracy was not different between experienced and less experienced radiologists (74% vs. 74% vs. 70%, 74%, and 78%, respectively). Interestingly, the sensitivity was higher, but the specificity was lower for experienced radiologists. In other words, experienced radiologists tended to understage, and less experienced radiologists tended to overstage. In addition, the inter-observer agreement of the experienced reviewers was higher than that of the less experienced reviewers, although the difference was not statistically significant.

In our study, all tumors were classified into three morphologic types: polypoid, polypoid with wall thickening, and wall thickening type. The overall accuracy was low in the polypoid type (63%, 78%, and 87% for polypoid, polypoid with wall thickening, and wall thickening type, respectively), as was the inter-observer agreement (0.07, 0.25, 0.32, for polypoid, polypoid with wall thickening, and wall thickening type, respectively). For polypoid-type tumors with a broad base, it is difficult to distinguish T1 and T2 because the boundary between the tumor and the gallbladder wall can be ambiguous. This may be one of the reasons for the low accuracy and low inter-observer agreement in polypoid-type tumors.

Our study has several limitations. First, this study was conducted at a single tertiary hospital. Therefore, our results may not be generalizable to other hospitals in different countries. Second, we included only T1- and T2-stage GB cancers. Therefore, in real clinical situations where everything from benign lesions to advanced GB cancers exists, the results of diagnostic accuracy and inter-observer variability may vary.

In conclusion, the accuracy of MDCT for differentiating T1 and T2 GB cancers is limited, and there is considerable inter-observer variability. The diagnostic accuracy and inter-observer variability were worst for polypoid lesions. There is a need for further research on how to improve the accuracy of preoperative T-staging in early GB cancer.