Introduction

Acute intestinal graft versus host disease (GvHD) is a severe potentially life-threatening complication after allogeneic bone marrow or peripheral hematopoietic stem cell transplantation (HSCT). Central to its pathogenesis are engrafted donor immune cells causing an allo-immunoreaction that leads to epithelial cell apoptosis, inflammation, and tissue injury [17]. The diagnosis of GvHD relies on clinical symptoms and—especially in non-diagnostic or precarious clinical settings—on histopathological findings [15]. However, assessment of severity of GvHD on histology, by grading the characteristic histological features, is often only weakly correlated with the clinical course of the disease [1, 16, 21, 23].

Difficulties in histopathological diagnosis of GvHD may result from different causes. In various proposed diagnostic scoring systems mostly classical, fully developed features are considered [10, 12, 14, 20]. However, in early cases with subtle features as well as in cases with heterogeneous features, subjective interpretation by individual pathologists may result in interindividual differences in both diagnosis and grading. Discrepancies may be marginal when pathologists from single or cooperating institutions have an opportunity to adjust each other’s diagnostic criteria [5, 20], but when there is no such exchange of experience, they could more profoundly affect inter-institutional reproducibility [19, 24]. Additionally, other diagnostic considerations include infection [2, 3] and toxic side effects of applied therapies [11, 18], which are both frequent in patients after HCT and could mimic GvHD histologically as well as clinically.

In order to evaluate the interobserver variability of the histological diagnosis of acute intestinal GvHD, we conducted an international round robin test with the original slides used for histopathological diagnosis. Based on the obtained results, we proposed consensus criteria for morphological diagnosis. This consensus includes subtle features of early disease and of gray zones between different histologic grades, as well as classical characteristics of advanced disease, all supported by photographic images.

Materials and methods

Materials for study

The round robin test on acute gastrointestinal GvHD, in which 5 pathologists participated, included 33 biopsies from 23 patients after allogenic HSCT. Since an initial evaluation of scanned whole slides was found to be insufficient in assessing delicate histological details needed for diagnosis, e.g., apoptotic bodies, the slides used for the original histological diagnosis were sent around for evaluation. We included 23 colonic, 5 small intestinal, and 4 gastric biopsies (Table 1) (one colonic and one esophageal biopsy were also sent, but not included in the presented evaluation because of a short interval after HSCT of only 8 days for the former and of having only one biopsy at that site for the latter). According to current diagnostic standards of the participating institutions, the evaluated slides were stained with hematoxylin-eosin (H&E) and additional conventional as well as immunohistological stains if deemed appropriate. Each H&E stained slide contained up to 8 tissue sections of a single biopsy. Additional stains included periodic acid-Schiff (PAS) stains for 6 biopsies, Giemsa stains for 4 biopsies, Epstein-Barr virus (EBV)-encoded RNA (EBER) in situ hybridization for 2 biopsies as well as immunostains (2 biopsies stained for cytomegalovirus (CMV), 2 for herpes simplex virus type one and two, and for CD20, CD56, CD138, MUM1, kappa, and lambda light chain on a single biopsy). Caspase immunostaining was not provided, since none of the institutions used it for routine GvHD diagnosis.

Table 1 Patient characteristics

The cases selected for review were contributed by institutes of pathology in Switzerland (Basel (A.T.)), Italy (Bolzano (G.N.)), and Germany (Regensburg (E.H.), Mainz (A.K), and Würzburg (A.M.)). All these institutions are associated with clinics which perform allogeneic HSCT. Cases for this study were not obtained consecutively or randomly, but rather, selected biopsies were specifically chosen to represent a spectrum of cases considered classical, difficult, rare, borderline, or not diagnostic for GvHD. The sample therefore included the full gamut of all histologic grades of acute gastrointestinal GvHD.

Methods for histologic evaluation

The participants were requested to provide a diagnosis (positive or negative for GvHD, or an alternative diagnosis), a Lerner score in case of GvHD and a semiquantitative evaluation of the number of apoptotic bodies, crypt destruction, and mucosa denudation, as outlined in Table 2 [8, 10]. The Lerner score is the most widely used score for the diagnosis of acute GvHD in Germany, Austria, and Switzerland.

Table 2 Lerner classification of acute intestinal GvHD

The accompanying clinical information initially provided on the request form by endoscopists or hematologists was made available to the participants, in order to simulate a standard histopathological diagnostic setting [15]. Lists the submitting institution, relevant patient data as available on the surgical biopsy request form, the type of tissue(s), and the GvHD histologic Lerner score(s). As we focused on histology, follow-up data or correlation with the clinical course was not included in the present study.

The findings of the initial round robin review were discussed in Basel (Switzerland) in 2012 at a meeting of the German-Austrian-Swiss GvHD Consortium. Differences between participants in assessment of histological criteria of GvHD (i.e., apoptosis, crypt destruction, and mucosa denudation) and their interpretation were identified. The need for consensus on criteria, including those covering advanced as well as subtle morphological changes, was recognized.

A questionnaire, containing questions concerning diagnostic strategies and histological criteria, was developed and sent to the 5 pathologists who took part in the initial round robin test and 7 others from institutions in which HSCT is performed (A.J. from France, M.A. and A.A.K. from Germany, I.M. and A.B.. from Austria, and D.C. and H.S. from USA). Digital photomicrographs of features of GvHD in gut biopsies from patients after HSCT with potential diagnostic significance were circulated between participants. The results were discussed at a follow-up meeting of the Consortium in Mainz (Germany) in 2013. A written consensus, including diagnostic strategies, diagnostic criteria (Table 3), and representative photomicrographs (Figs. 1, 2, 3, 4, 5, and 6), was approved by all 12 participants. Subsequently, the slides used for the original histological diagnosis were again sent to the five institutions of the first robin test for reevaluation by applying the consensus criteria. The time lapse between the two evaluations was at least 1 year for each of the scoring pathologists.

Table 3 Consensus on diagnostic criteria of intestinal acute GvHD
Fig. 1
figure 1

Typical examples of crypt cell apoptosis (HE, original magnification ×400): a, d each two apoptotic bodies (only one would not have been diagnostic), with surrounding halo; b, c, and e more apoptotic bodies, easy to recognize, with a “dusty” appearance

Fig. 2
figure 2

These features were not regarded to be diagnostic for apoptosis (HE, original magnification ×1000): a revealed only a single eosinophilic element of uncertain provenience, b possibly a lymphocyte

Fig. 3
figure 3

Typical examples of crypt destruction (HE, original magnification ×400): a with destruction of more than 1/3 of the circumferential crypt epithelium by confluent apoptosis; b and c with flattened epithelium; and d the connection to normal appearing superficial parts of the crypt

Fig. 4
figure 4

Crypts not considered representing crypt destruction (HE, original magnification ×400): Crypt epithelium is not flattened; a reveals some inflammatory cells; in b, c, and d, apoptosis is present in a small portion of the epithelial cells; and e a small crypt filled with detritus and focal atrophy of the epithelium

Fig. 5
figure 5

Typical examples of mucosa denudation (HE, original magnification a, c, d ×100, b ×400; with b showing a detail of a at higher magnification.) These examples show the loss of the surface epithelial layer and of many crypts as well as some inflammatory reaction of cells and fibrin at the surface

Fig. 6
figure 6

Mucosal changes not accepted to be diagnostic for mucosa denudation (HE, original magnification a ×400, b ×1000, c ×250 with b showing a section of a. Few inflammatory cells without fibrin deposits, while the loss of the epithelium is most likely due to mechanic injury

Statistical evaluation was performed for interobserver agreement (in %) and for multiple observers using Fleiss generalized kappa. We found a high number of biopsies with diagnosis “GvHD” compared to “no GvHD” (high marginal heterogeneity), which could have resulted in a high chance-agreement probability with a relatively low kappa value [6], and therefore, we also computed a reliability coefficient with an adjusted change agreement, the AC1 of Gwet, which was shown to compensate for the effect of high marginal heterogeneity [22]. To compute the results, AgreeStat 2013.1 software (Advanced Analytics, LLC, Gaithersburg, MD, USA [7]) on Microsoft Excel 2013 (Microsoft, Seattle, WA, USA) was used.

Results

Histology

The histologic architecture was well preserved in most of the 24 biopsies of the colon. Occasionally, clusters of apoptotic cells were noted in crypt epithelium. The extent of crypt destruction varied and included severe damage, qualifying as crypt destruction. Some cases had features of crypt loss and/or mucosa denudation. CMV immunohistochemistry was done in one biopsy from the colon but was negative. Of the 5 biopsies of the small intestine, 3 were from the duodenum, and of 2 others, complete mucosal denudation hampered histological identification of their origin. The duodenal biopsies showed well preserved architecture, and a variable number of apoptotic bodies, with occasional crypt destruction or mucosal denudation. In one biopsy, an atypical plasmacytoid infiltrate was suggestive of post-transplantation lymphoproliferative disease (PTLD). This case was also positive for CMV and EBER. The four gastric biopsies showed apoptotic cells and some crypt destruction, up to mucosal denudation in two samples.

Consensus criteria

The authors agreed that for a diagnosis GvHD or its grading, the most advanced tissue alteration in a given biopsy needed to be considered. The consensus diagnostic criteria for apoptosis included shrinkage of crypt epithelial cells in combination with condensed nuclear chromatin and cytoplasmic eosinophilia (Fig. 1). As an alternative, a minimum of two particles of nuclear debris in one area or signs of phagocytosis of cell debris were accepted. In contrast, close contact of intraepithelial lymphocytes with apoptotic epithelial cells (satellitosis), even though characteristic of apoptosis, was not considered necessary for establishing apoptosis. Apoptotic cells in luminal surface epithelium or in the lamina propria were not considered diagnostic for GvHD, but part of the physiological cell turn over [4]. Isolated cells with condensation of nuclear chromatin, shrinkage, or ballooning were also not considered sufficient for establishing apoptosis, although they do indicate cell damage (Fig. 2).

As criteria for crypt- or gland destruction, destruction of at least one third of the enterocytes comprising the circumference of a cross-sectioned crypt was deemed necessary. Dilation of a crypt accompanied by epithelial flattening and luminal cell debris was considered characteristic of an apoptotic crypt or gland abscess (Fig. 3). Without these signs of damage of the epithelial cells or in the presence of a lesser degree of atrophy of epithelial cells, crypt destruction could not be concluded (Fig. 4). The mere loss of crypts might be a remnant of previous episodes of GvHD, in correlation with clinical symptoms and therapy refractoriness [12, 15], but not a sign of acute GvHD.

Lack of epithelial cells on the mucosal surface was considered essential to conclude mucosal denudation. In addition, deposition of fibrin or accumulation of inflammatory cells and/or widening of small blood vessels was required to avoid confusion with artificial damage (Figs. 5 and 6).

A complete list of histological consensus criteria is given in Table 3.

Interobserver agreement

First round robin test

Before the consensus, all 5 observers agreed in 24 of the 33 cases on the diagnosis of GvHD (positive 23, negative 1) (Table 1). In 4 of the remaining 9 cases, there was one divergent interpretation, while in 5 more cases, 2 pathologists rendered a divergent diagnosis. The discordances were related to concerns of focal or subtle changes (4 cases), the possibility of concurrent infections (CMV in one case, EBV in one case with in addition PTLD), of therapy-related toxic changes (2 cases), or of advanced tissue damage without determinable cause (1 case). The agreement was 84 % (kappa 0.347, Gwet’s AC1 0.792) for the diagnosis of GvHD.

A divergence in grading of acute GvHD occurred in 20 of the 24 cases diagnosed as GvHD by all participants. Differences of one grade occurred in 16, of 2 grades in 2, and of 3 grades also in 2 cases. Divergence in 7 of these cases related to possible CMV infection and in one to therapy-related toxic changes. The agreement for Lerner grading was 48 % (kappa 0.322; Gwet’s AC1, 0.323). When the grading categories were taken together as low grade (Lerner grade 1 and 2) or high grade (grade 3 and 4), the agreement increased to 68 % (kappa 0.457; Gwet’s AC1, 0.552).

Second round robin test

After the written consensus, an agreement on the diagnosis of GvHD was reached in 25 of the 33 cases (positive 23, negative 2). Of the remaining 8 divergent cases, 2 had one divergent diagnosis while the remaining 6 had 2 divergent diagnoses. Divergent diagnoses were due to focal or subtle disease (2 cases), the possibility of concurrent infections (CMV in one case, EBV in one case with in addition PTLD), of therapy-related toxic changes (2 cases), or of advanced tissue damage without determinable cause (1 case). The agreement for a diagnosis of GvHD increased only marginally to 85 % (kappa 0.396; Gwet’s AC1, 0.805).

As to grading of the 25 cases diagnosed as GvHD, 10 observers agreed on the grade but 13 diverged with differences of one grade in 8, of 2 grades in 3, and of 3 grades in 2 cases. In 6 of these cases, a possibility of CMV coinfection and, in one case, toxic changes were considered. The interobserver agreement for grading increased to 60 % (kappa 0.455; Gwet’s AC1, 0.512), and by clustering the grading categories to low grade and high grade, the agreement increased to 74 % (kappa 0.55; Gwet’s AC1, 0.634).

Discussion

Shortcomings of established grading schemes such as Lerner histologic grading [10] and related modifications [14, 20] have been pointed out in earlier [15] and recent [16] consensus documents dealing with the histopathological diagnosis of gastrointestinal GvHD. Firstly, diagnostic criteria for a minimal degree of GvHD are not standardized and subject to individual variation. Secondly, tissue damage evolves, and during the course of disease, different steps may cumulate. When a biopsy is taken at the onset of gut GvHD, the diagnosis may be made when only a few apoptotic cells are found, without marked mucosal crypt damage or loss. In our study, Lerner grading included threshold signs as well as cell damage accumulated in later stages of the disease.

The initial histological evaluation resulted in substantial disagreement on diagnosis and grading of GvHD. This was partly due to a lack of precision in defining apoptosis, crypt destruction, and mucosal denudation. With the additional input from a larger panel of reviewers, we developed consensus definitions of essential histological features which constitute the basis for a diagnosis of gut GvHD. With these consensus diagnostic criteria, a measurable improvement in the reproducibility of diagnosis (negative or positive for GvHD) and grading (particular of low and high grade GvHD) was attained. This is clinically relevant as early GvHD is not easily detectable by endoscopy [9], which leaves open other causes of diarrhea such as infection or therapy-related toxicity.

Consensus on criteria for a histological diagnosis of GvHD are particularly important when results from different institutions are compared, such as quantification—in order to establish a threshold for diagnosis—and grading [13]. Although correlation with the clinical course, which was not included in our study, would be necessary to validate diagnostic criteria, most participants considered up to two apoptotic cells per 4 mm2 detected at a magnification of ×100 (using a ×10 eyepiece with a ×10 objective) as sufficient for a diagnosis of low-grade GvHD.

Consensus criteria also improved reproducibility of GvHD grading, from 48 to 60 %. Clustering of the four Lerner grades into low (grades 1 and 2) and high grade (grades 3 and 4) increased the agreement from 68 % (first round) to 74 % (second round). Using the Fleiss kappa for computing the interobserver agreement, we faced the “first kappa paradox” due to the high percentage of GvHD diagnoses of in our biopsies [6]. Kappa value was relatively low, especially for GvHD versus no GvHD. Gwet’s AC1 values are less affected by prevalence and marginal probability and therefore provide a more stable reliability coefficient [7, 22]. We found Gwet’s AC1 values to be closer to the percentage of agreement.

The remaining diagnostic divergences were mostly due to unresolved differential diagnoses, especially viral infection (such as CMV and adenovirus) and therapy-related toxicity, which may produce histological features similar to those of GvHD [11, 18]. This underscores the need for microbiological and immunohistological studies. It should be borne in mind that immunohistochemical evidence of a CMV infection does not exclude GVHD because GVHD, infection, and therapy-related toxicity are not mutually exclusive.

A potential limitation of our study is selection bias, as the included cases were neither prospectively nor randomly collected. Participants selected diagnostically difficult cases, which they considered of particular interest based on their experience, including biopsies not diagnostic for GvHD and various grades of acute GvHD. The kappa values for interobserver agreement should therefore not be regarded as a benchmark for a diagnostic standard, which would have to be established on a random sample or consecutive case series.

In summary, we have developed a set of criteria that define histological features of gut GvHD, notably apoptosis, crypt destruction, and mucosal loss. These can be applied to any subsequent study. When evaluating samples with minimal changes, perfect agreement is difficult to attain. We show that consensus diagnostic histological criteria improve interindividual reproducibility. Close interaction between pathologists and attending clinicians remains a cornerstone for adequate interpretation of histopathological observations, in providing arguments to eliminate critical differential diagnoses.