Introduction

Head and neck cancer is the 7th most frequent cancer in the world [1]. The vast majority is squamous cell carcinomas (SCC) in the pharynx, larynx or oral cavity, which are commonly associated with risk factors such as smoking, alcohol, and age [2]. Cervical lymph node metastases are frequent and have adverse prognostic significance for the patients [3, 4].

Extranodal extension (ENE) is the growth of lymph node metastasis beyond the lymph node capsule and is of critical importance for correct management of patients with head and neck SCC (HNSCC) [5,6,7,8,9,10]. Several studies have shown significantly worsened survival and increased relapse rates in HNSCC patients with ENE leading to the implementation of ENE in the latest TNM staging system in 2017 [11,12,13]. Subsequently, ENE is currently considered an indication for adjuvant chemoradiotherapy and contraindicative for unimodal surgical treatment [14,15,16].

Despite of being of high clinical importance, pathological findings qualifying a diagnosis of ENE seem to be uncertain [17, 18]. Thus, the reported incidence of ENE in HNSCC lymph node metastases varies between 20 and 85% [19,20,21]. Reasons may be insufficient imaging technologies and variability in histopathological conclusions among pathologists [22]. A recent systematic review of the literature identified 44 unique definitions of ENE, elucidating the lack of a succinct and generally accepted definition [23]. Obviously, this raises concerns about the reliability of the histopathological diagnosis of ENE and the successive treatment planning.

The aims of this study were: (1) to determine the interrater and intrarater reliability and agreement in the histopathological assessment of ENE among Danish pathologists and (2) to test if introduction of a standardized assessment method may increase interrater agreement.

Materials and Methods

This study was performed as a prospective reliability and agreement study in accordance with the Guidelines for Reporting Reliability and Agreement Studies (GRRAS) [24]. Legal approval from the Research Board of Odense University Hospital (OUH) was obtained before the start of the study.

Sample Size

The sample size estimation was based on assumptions made for the proportion of agreement between every pair of two pathologists. An 80% minimum proportion of agreement (the null hypothesis) with a precision of 0.1 on each side was assumed. With a prevalence of ENE around 63%, the number of histological slides required for the study was calculated to 123 at a significance level of 5% and a power of 80%. Calculations were made according to the methods described by Hong et al. [25].

Study Sample

One-hundred-twenty-three histological slides were identified through a combined search in the electronic patient journal system at the Department of ORL—Head & Neck Surgery and Audiology, OUH, Denmark and the local database at the Department of Pathology, OUH. We searched for patients from the time period 1st of November 2014 to 31st of October 2019, classified as having oropharyngeal squamous cell carcinoma (OSCC) with the code DC102, DC103, DC108, DC109 or DC090-DC099 according to the International Classification of Diseases version 10 (ICD-10). Subsequently, we performed a search in the pathology database from the same period to identify lymph nodes from the neck with squamous cell carcinoma coded T081* or T082* and M807* or M8083* excluding P31060 (fine needle aspirations) according to the systematized nomenclature of medicine (SNOMED). Finally, we were able to match the results from the two searches to identify patients with OSCC, who had surgically removed metastatic lymph nodes. From these patients, we identified all slides with lymph node metastases and extracted a consecutive list with the 123 newest slides. Slides with no evidence of lymph node metastasis were excluded.

Demographic and health data on the included patients were collected from electronic medical reports. Variables collected were birthdate, gender, diagnosis, p16 status and TNM classification.

All lymph nodes for evaluation had been mounted as histological slides, stained with standard hematoxylin and eosin. The slides were then digitally scanned at × 40 magnification using Hamamatsu Nano Zoomer S360. For this study, the samples were retrieved as digital images from the local database. All images were anonymized before they were saved on an encrypted external hard drive, which was physically sent by mail to the pathologists in the assessment team.

Human Papilloma Virus (HPV) association was determined by immunohistochemical staining for p16 as part of the diagnostic work-up.

Pathologic Evaluation

Four specialized head and neck pathologists from the three largest head and neck cancer centers in Denmark were invited to participate in the study. One (NCW) from the University Hospital of Copenhagen, one from Aarhus University Hospital (BPU) and two from our own center at OUH (SRL & TMG).

The assessment of the slides was done using NanoZoomer Digital Pathology View v. 2.7.52 (© Hamamatsu Photonics K.K.). Each digitalized slide was evaluated twice by all pathologists through round 1 and round 2. Additionally, two of the pathologists evaluated the slides in a third round for intrarater reliability and agreement. To reduce recall bias, repeated evaluations of the histological slides were separated by a minimum interval of three months. Also, the sequence of the slides was randomized for every round. After each round the pathologists handed in the external hard drive and did not receive it again before the beginning of the next round. The pathologists were blinded to clinical information, previous evaluations, and the ratings of the other participating pathologists. However, they were informed about the title and the purpose of the study, thus knowing they were rating cervical lymph node metastases from patients with OSCC in order to estimate reliability and agreement parameters. Ratings were conducted independently, and the pathologists were specifically instructed not to communicate with each other about assessment results during the study.

In the first round, the pathologists reviewed each slide to determine the presence or absence of ENE based on immediate impression and clinical routine. Additionally, they were asked to grade each slide from 0 to 4 according to the classification system of ENE presented by Lewis et al. [26].

In the second round, the pathologists reviewed each slide again for assessment of ENE and grading. However, this time they were instructed to follow our proposed uniform definition of ENE when determining the presence and grade of ENE: “Squamous cell carcinoma within the confinement of a metastatic lymph node that grows through the lymph node capsule or beyond the lymph node contour into the adjacent tissue regardless of the size of the extension. In case of doubt, the presence of a desmoplastic response confirms the suspicion of ENE” [23]. In addition, they were introduced to the latest guidelines from the International Collaboration on Cancer Reporting (ICCR) and asked to refer to these in case of doubt [27].

In the third round, two pathologists repeated the exact same procedure with the same instructions as in the second round.

Statistics

To examine the reliability and absolute agreement within and between raters, kappa coefficients and proportions of agreement were calculated [24]. Cohen’s kappa coefficient was calculated for interrater reliability between every two raters in each round and for intrarater reliability between round 2 and 3 for each pathologist. For the overall reliability among all pathologists, Fleiss’ multi-rater kappa coefficient was calculated for each round. The proportion of agreement was reported as the count of observed agreements between pathologists divided with the total number of evaluated slides. In addition, weighted kappa coefficients were calculated in the analyses of ENE grading reliability between two raters. We defined our own weighting matrix, giving full credit to absolute agreement, 0.75 credit to disagreement between grade 0 and 1 and between grade 2 and 3, half credit to disagreement between grade 2–3 and 4, and finally no credit to disagreement between grade 0–1 and grade 2–4. Exploratory hypothesis testing of bivariate associations of categorical variables was done with Fisher’s exact test, and the significance level was 5% (two-sided). All analyses were done using Stata/BE v17.0 (StataCorp LLC, College Station, Texas, USA). Graphics were done with Microsoft Excel 2016.

Results

Study Sample

Of the original 123 slides, one was excluded due to no evidence of carcinoma in the lymph node, whereas two others were excluded because they appeared to originate from mucosal biopsies rather than lymph nodes. In total, 120 histological slides of separate cervical lymph nodes with metastatic squamous cell carcinoma from 54 OSCC patients were included. The median number of slides per patient was two (range 1–8). The average age was 62.2 and more than 2/3 of the patients were men. Two of the patients were originally diagnosed with cancer of unknown primary tumor (CUP) but relapsed at their oropharyngeal T-site half a year and three years later, respectively. There was a majority of p16+ cancers and most of the patients had low T-, N-, and M-stages. See Table 1 for detailed patient characteristics.

Table 1 Patient characteristics

Raters

Four pathologists from three different hospitals across Denmark rated the same 120 digitalized histological slides. Two of the pathologists from the same hospital (Odense University Hospital) had three replicate observations each (round 1 to 3), whereas the other two pathologists evaluated the slides only through the first two rounds. One of the pathologists did not report data for one of the slides in the first round and therefore only had 119 observations in that round. All pathologists were consultants with median of 8 years of head and neck pathology experience ranging from 2 to 15 years.

Interrater Reliability and Agreement

In the first round, the overall proportion of observed agreement for the histopathological diagnosis of ENE between the four pathologists was 0.66 corresponding to a kappa coefficient of 0.61 (95% CI 0.51–0.71). In the second round, the proportion of agreement increased to 0.72 and a kappa coefficient of 0.68 (95% CI 0.58–0.77). The number of observed lymph nodes with ENE ranged between 45 (38%) and 55 (46%) in the first round and 36 (30%) and 51 (43%) in the second round. In Table 2, the total count of observed ENE is stated for each pathologist and round. The kappa coefficients between two pathologists ranged from 0.57 (95% CI 0.42–0.72 between pathologist 1 and 4) to 0.66 (95% CI 0.53–0.80 between pathologist 1 and 3) in the first round and 0.59 (95% CI 0.45–0.74 between pathologist 3 and 4) to 0.78 (95% CI 0.66–0.89 between pathologist 2 and 4) in the second round. The overall difference between round 1 and 2 in kappa coefficients and proportions of agreement was 0.066 and 5.30% points, respectively. See Fig. 1 for the boxplot of kappa coefficients in the two rounds and Supplementary Table 1 for a detailed overview of reliability estimates and agreement measures for the histopathological diagnosis of ENE.

Table 2 Total count of ENE
Fig. 1
figure 1

Boxplot of interrater reliability measures for presence/absence of ENE between every two pathologists in round one and two. The plotted kappa coefficients are between pathologist 1 and 2, 1 and 3, 1 and 4, 2 and 3, 2 and 4, and 3 and 4. The rectangles represent the second and third quartiles, the horizontal line inside represents the median, and the horizontal lines outside represent the maximum and minimum

The overall proportion of observed agreement for ENE grading between the four pathologists was 0.24 in the first round corresponding to a kappa coefficient of 0.35 (95% CI 0.32–0.39). In the second round, after the introduction of our proposed definition of ENE, the proportion of agreement increased to 0.30 with a kappa coefficient of 0.41 (95% CI 0.34–0.44). The distribution of grades for each pathologist is illustrated in Fig. 2. The kappa coefficients between two pathologists ranged from 0.28 (95% CI 0.21–0.29 between pathologist 2 and 3) to 0.42 (95% CI 0.38–0.47 between pathologist 1 and 2) in the first round and 0.31 (95% CI 0.28–0.36 between pathologist 1 and 3) to 0.51 (95% CI 0.45–0.60 between pathologist 3 and 4) in the second round. The overall difference between round 1 and 2 in kappa coefficients and proportions of agreement was 0.060 and 5.6% points, respectively. The weighted kappa coefficients ranged from 0.46 (95% CI 0.38–0.56 between pathologist 1 and 4) to 0.57 (95% CI 0.50–0.59 between pathologist 2 and 4) in the first round and 0.53 (95% CI 0.46–0.59 between pathologist 1 and 3) to 0.66 (95% CI 0.58–0.75 between pathologist 2 and 4) in the second round. See Supplementary Table 2 and 3 for all reliability estimates and agreement measures for ENE grading.

Fig. 2
figure 2

Bar chart with the count of observed histological slides graded from 0 to 4 according to Lewis’ ENE classification system

Interrater Agreement and p16 Status

Eighty-nine histological slides from 38 patients with p16+ cancers were included in the study. A median of 30 (range 25–37) (33%) were diagnosed with ENE in the first round and 28 (range 21–31) (31%) in the second round. The proportion of ENE in histological slides with p16- cancer was 61% in both rounds. There was a higher proportion of agreement for p16- (81%) than p16+ (61%) cancers in the first round with a p-value of 0.076. However, this changed in the second round to almost no correlation with p16 status with a p-value of 0.49. See Table 3 for agreement and p16 status.

Table 3 Agreement according to p16 status

Interrater Agreement in Primary Versus Recurrent Cancers

No difference was detected in the proportion of agreement between histological slides from primary and recurrent cancers. P-values were calculated to 0.82 and 0.99 for respectively round 1 and 2.

Intrarater Reliability and Agreement

Two pathologists evaluated the same histological slides in a third round. The proportion of observed agreement for the histopathological diagnosis of ENE between round 2 and 3 was 0.88 for pathologist 1 and 0.93 for pathologist 2. This resulted in the kappa coefficients 0.76 (95% CI 0.64–0.88) and 0.84 (95% CI 0.74–0.94).

The proportion of observed agreement for grading of ENE between round 2 and 3 was 0.72 for pathologist 1 and 0.81 for pathologist 2 with resulting kappa coefficients of 0.61 (95% CI 0.56–0.68) and 0.75 (95% CI 0.65–0.86), respectively.

Calculating the weighted kappa coefficients, agreement proportions changed to 0.84 for pathologist 1 and 0.89 for pathologist 2 with the weighted kappa coefficients of 0.71 (95% CI 0.62–0.81) and 0.80 (95% CI 0.75–0.81).

Discussion

To our knowledge, this study is the most extensive inter- and intrarater study on ENE in HNSCC. It showed a moderate level of reliability and agreement among Danish pathologists in the assessment of histopathological ENE in lymph node metastases from OSCC and only a minor trend towards improvement after introduction of standardized methods for evaluation. These findings indicate significant inconsistencies in the histopathological diagnosis of ENE and call for methods and definitions, which may improve diagnostic certainty.

In advanced and extensive cases, ENE may be detectable by clinical examination or imaging. The diagnostic accuracy of different imaging technologies has been evaluated in several studies and found to be rather low [22]. Thus, neck dissection with histopathological assessment of extirpated lymph nodes including evaluation of ENE is still considered as the gold standard in cancer staging.

Few studies regarding diagnostic certainty of ENE have been performed. In 2012, van den Brekel et al. examined the observer variation in ENE diagnosis among 10 pathologists [17]. Forty-one metastatic lymph nodes from 18 HNSCC patients were evaluated in two rounds. They found poor interrater agreement with overall kappa values of 0.42 and 0.49. Kappa coefficients between two pathologists ranged from 0.14 to 0.73 in the first round and from 0.20 to 0.75 in the second round. The study did not include information regarding p16/HPV status. In 2015, Lewis et al. showed an interobserver agreement of 48% with a kappa-value of 0.51 (95% CI 0.36–0.64) among five pathologists rating ENE in node metastases [18]. The study included 50 histological slides from 50 patients with p16+ OSCC. Interestingly, in a second assessment round, they introduced a self-developed grading system from zero to two based on the degree of tumor invasion into the perinodal tissue. Dichotomizing the grades to ± ENE, they achieved a better agreement ratio at 68% corresponding to a kappa value of 0.64 (95% CI 0.47–0.78). The kappa values between two pathologists ranged between 0.36 and 0.72 in the first round and 0.51 and 0.75 in the second round.

Our study included fewer raters but a larger study sample with both p16+ and p16- OSCC. Moreover, the raters were asked both to assess the presence of ENE and to apply the non-validated grading system, developed by Lewis et al. in 2011 [26]. The generally higher concordance in our study may reflect the beneficial value of a grading system as a supportive tool. Nevertheless, we found a lower agreement using the classification system by Lewis et al. compared to a simple dichotomous assessment of ± ENE. This result is not surprising, as the classification system allows five different outcomes and thereby a higher risk of inconsistency. Instead of dichotomizing data, we found it more relevant to do an analysis of weighted kappa coefficients categorizing grade 0–1 (ENE not present) as similar, but not equal, and the same for grade 2–4 (ENE present). The analysis showed considerably higher levels of concordance, supporting that much of the discrepancy was caused by disagreements within similar ENE categories.

Interestingly, our results showed a tendency towards a difference in disagreement between p16+ and p16- tumors. The difference was not statistically significant (p = 0.076). However, looking at the first round, we found an overall disagreement of 39% in assessment of metastases from p16+ carcinomas vs. 19% for p16-, indicating higher difficulties in evaluation of p16+ lymph node metastasis. An explanation may be that lymph nodes with metastasis from p16+ OSCC often grow in a ballooning/cystic manner, where the capsule is stretched and thickened (Fig. 3). Some pathologists may consider this being ENE caused by a breach in the capsule and fibrotic reaction to tumor growth in the perinodal tissue with the formation of a so-called pseudo-capsule [28]. However, other pathologists would say that the metastasis is within the confinement of the lymph node and therefore not indicative of ENE. Although current studies have shown lack of correlation between ENE in p16+ lymph nodes and prognosis, pathologists should still report ENE in these cases for educational and research purposes [29,30,31]. Our study supports this, since disagreement on the histopathological diagnosis potentially could be an explanation for the lack of association with worsened prognosis in this cohort of patients.

Fig. 3
figure 3

Lymph node with metastasis from p16+ oropharyngeal squamous cell cancer and pseudocapsule. According to our definition, this finding does not qualify as ENE. (hematoxylin–eosin)

A major reason for the variation in the diagnosis of ENE among pathologists may be the lack of consensus on a histopathological definition. We saw a trend towards an increase in agreement and a decrease in the fraction of lymph nodes with ENE after introducing the pathologists to a uniform definition of ENE. The higher reliability was also evident in the intrarater kappa coefficients for the two pathologists that completed a third evaluation round. Based on our results and the current literature, we recommend pathologists to discuss and agree on a consensus definition of ENE, preferably at an international level.

However, we also acknowledge the fact that a written definition cannot stand alone, since pathologists could interpret it differently. This was reflected in the higher spread of kappa coefficients between every two pathologists in the second round compared to the first round (Fig. 1). The more cautious ENE determination in the second round may be due to the provided guideline, which in some cases recommended a more conservative approach.

Limitations

A limitation to the study was the analysis of reliability and agreement based on one single histological slide for each outcome. In the daily clinical setting, pathologists would in cases of doubt, examine additional sections and levels and sometimes even do further analyses to make their final diagnosis. In suboptimal tissue sections where part of the lymph node capsule is not present, the assessment is hampered, and the pathology report would include a qualitative description stating the diagnostic uncertainty. This was not possible in our study since we demanded a definitive binary yes or no answer.

Another limitation was the use of scanned histology images. At the time of the study, digitalized slides were not used in the daily routine work in all institutions across the country. For that reason, the assessment of the slides did not reflect the daily routine for all the pathologists. However, digitalized slides were easier to distribute among the geographically separated pathologists. In return, we were able to assemble a team of pathologists with many years of head and neck experience. Moreover, the risk of the pathologists aligning their results was lower; we were able to randomize the order of histological slides for each round; and we were in better control of the minimum interval of time between rounds, minimizing recall bias even more.

Intrarater reliability was calculated based on only two of the pathologists, since the other two pathologists did not participate in the third round. This was a compromise between logistics and importance of calculating intrarater kappa coefficient for every pathologist. We do not believe, the inclusion of the other pathologists would have changed our results.

In conclusion, ENE diagnosis in lymph node metastases from OSCC is a significant challenge even for highly specialized pathologists with many years of experience. It is very important for researchers, clinicians, and patients to be aware of the limitations in the pathologic assessment of ENE as shown in this study, since it may have huge impact on the successive management of the patients. In the era of de-escalation therapy for p16+ OSCC with ENE, these concerns must be taken into consideration. [32,33,34].

Conclusion

The current study showed moderate level of reliability and agreement among Danish pathologists in the assessment of histopathological ENE in lymph node metastases from OSCC. The intrarater reliability and agreement was generally higher than between different pathologists. Interrater reliability increased slightly after the introduction of defined guidelines and a new standardized ENE definition. We conclude that an agreed-upon ENE definition is helpful but cannot stand alone in increasing the interrater agreement.