Introduction

Diagnosis and research in colorectal cancer (CRC) routinely relies on the interpretation of protein expression detected by means of immunohistochemistry (IHC). The lack of standardized IHC scoring systems has led to a variety of unvalidated methods used to assess immunoreactivity. Scoring systems for tumor markers in CRC are usually based on a measure of the proportion of positive tumor cells and are often combined with a degree of staining intensity [3, 9, 12, 29]. It is recognized that the interpretation of staining intensity is not only highly subjective but may be affected by storage time, variation in protocols, and fixation procedures [2, 13]. Despite these concerns, staining intensity has become an integral component of many IHC scoring methods for tumor markers in CRC [7, 8, 10, 12, 23, 2527].

We have recently shown that a descriptive, semiquantitative scoring system for IHC based on the percentage of immunoreactive tumor cells (percent positivity) provides a more complete assessment of the predictive and prognostic value of several tumor markers in CRC when compared to an evaluation system using “negative/positive” [1619]. Additionally, we have studied the interobserver variability of this semiquantitative method among pathologists for several proteins namely, epidermal growth factor receptor (EGFR), vascular endothelial growth factor, p53, Bcl-2, and apoptotic protease-activating factor 1 (APAF-1), in tumor biopsies and tissue microarray (TMA) punches and have shown that it is reproducible [30, 31, 34].

One of the most important advantages of a semiquantitative scoring method is that it allows the investigator to establish more biologically or clinically relevant cutoff scores for positivity for the protein and the outcome under study rather than relying on an often arbitrary threshold value, such as 10%, to describe a tumor as “positive.” Such a method to ascertain cutoff scores has been recently proposed using receiver operating characteristic (ROC) curve analysis and has been applied, along with several other tumor markers, to EGFR and APAF-1 [3234].

The purpose of this study was to determine whether staining intensity in conjunction with percent positivity should be used as an indicator of protein expression detected by IHC. The associations of p53-, Her2/neu-, EGFR-, adenomatosis polyposis coli (APC)-, and β-catenin-staining intensity with a range of clinico-pathological features, notably T stage, N stage, tumor grade, vascular invasion, and survival, were evaluated in 1,197 mismatch repair (MMR)-proficient CRC.

Materials and methods

Tissue microarray construction

A TMA of 1,420 unselected, nonconsecutive CRCs was constructed [24]. Briefly, formalin-fixed, paraffin-embedded tissue blocks of CRC resections were obtained. One tissue cylinder with a diameter of 0.6 mm was punched from morphologically representative tissue areas of each donor tissue block and brought into one recipient paraffin block (3 × 2.5 cm) using a homemade semiautomated tissue arrayer. The resulting TMA set comprised three slides.

Clinico-pathological data

The clinico-pathological data for all patients included T stage (T1, T2, T3, and T4), N stage (N0, N1, and N2), tumor grade (G1, G2, and G3), vascular invasion (presence or absence), and survival time [19].

Immunohistochemistry

Four-micrometer sections of TMA blocks were transferred to an adhesive-coated slide system (Instrumedics, Hackensack, NJ). Standard indirect immunoperoxidase procedures were used for IHC (ABC-Elite, Vector Laboratories, Burlingame, CA). One thousand four hundred and twenty CRCs were immunostained for mutL homolog (MLH)1 (clone MLH-1; dilution 1:100; BD Biosciences Pharmingen, San Jose, CA), MSH2 (clone MSH-2; dilution 1:200; BD Biosciences Pharmingen, San Jose, CA), and MSH6 (clone 44; dilution 1:400, Transduction Laboratories). After dewaxing and rehydration in deionized H2O, sections were subjected to heat antigen retrieval in a microwave oven (1,200 W, 15 min) in 0.001 mol/L ethylenediamine tetraacetic acid at pH 8.0 for MLH1 and MSH2 and in 0.01 mol/L citrate buffer (pH 7.0) for MSH6. Endogenous peroxidase activity was blocked using 0.5% H2O2. The sections were incubated with 10% normal goat serum (Dako Cytomation, Missassauga, Canada) for 20 min and incubated with a primary antibody at room temperature (Her2/neu clone PN2A, DAKO, Denmark; p53 clone DO-7, 1:100; DAKO; clone β-catenin-1, dilution 1: 200; Dako Cytomation, APC clone C20, dilution 1:50; Santa Cruz, CA). Subsequently, sections were incubated with secondary antibody (K4005, EnVision+ System-HRP (AEC); Dako Cytomation) for 30 min at room temperature. For visualization of the antigen, the sections were immersed in 3-amino-9-ethylcarbazole + substrate-chromogen (K4005, EnVision+ System-HRP (AEC); Dako Cytomation) for 30 min and counterstained with Gill’s hematoxylin. IHC for EGFR was performed using an automated stainer (EGFR clone 3C6, 3 mg/mL; Ventana Medical Systems, Tucson, USA) according to the manufacturer’s instructions. TMA slides for each protein were stained on the same day under identical conditions.

MMR results

The 1,420 CRCs were stratified according to MMR status: (1) MMR-proficient tumors expressing MLH1, MSH2, and MSH6, (2) MLH1-negative tumors, and (3) presumed hereditary nonpolyposis colorectal cancer (HNPCC) cases demonstrating loss of MSH2 and/or MSH6 at any age or loss of MLH1 at age less than 55 years [11]. In population-based studies, the mean age of diagnosis of Lynch syndrome is around 55 years. By contrast, sporadic high-level microsatellite instability (MSI-H; MMR-deficient) CRC is far more age-related with a mean age of onset of around 75 years and few cases occurring younger than 60 years. A cutoff of 55 years was set as a reasonable compromise for distinguishing CRCs with loss of MLH1 into likely HNPCC syndrome vs likely sporadic. These immunohistochemical groupings showed a good fit with the known clinicopathological features associated with these subsets of CRC. Particularly, the MLH1-negative group was associated with advanced age, predilection for women and the proximal colon, large tumor size, and poor differentiation. The presumed HNPCC group was young and showed no gender difference, and there was a predilection for the proximal colon as compared with the MMR-proficient group. While it is possible that a small proportion of presumed sporadic MSI-H and HNPCC cases were incorrectly assigned, the overall findings are likely to be valid in view of the large numbers of samples and the good fit with clinico-pathological features. Only MMR-proficient tumors were included in the study to ensure a homogeneous sample of tumors (N = 1,197, 84.4%).

IHC evaluation

Immunoreactivity was evaluated in all 1,420 punches by one experienced pathologist. TMA punches with insufficient tissue or tumor for analysis were excluded from the study. Protein expression was scored in the nucleus for p53 and β-catenin and in the cytoplasm for APC. EGFR and Her2/neu positivity were scored in both cell membrane and cytoplasm. Immunoreactivity was assigned a score based on the proportion of positive tumor cells over total tumor cells (percent positivity) ranging from 0 to 100%. Staining intensity was evaluated as 0 = negative, 1 = weak, 2 = moderate, and 3 = strong. If the staining intensity was heterogeneous, then scoring was based on the greatest degree of intensity. MLH1, MSH2, and MSH6 were scored in the nucleus as negative (0%) and as positive (>0%).

Statistical analysis

Interobserver reproducibility of scoring percent positivity and staining intensity

To determine the interobserver reproducibility of percent positivity and staining intensity, a minimum of 100 CRC punches was evaluated by a second pathologist. The intraclass correlation coefficient (ICC) defined as the ratio of the between-subject variance over the between-subject + within-subject variances was used to determine the reliability of percent positivity for each protein. The ICC has previously been used to assess the agreement of IHC scores [14, 31]. An ICC of 0.7 or greater is considered sufficient to establish reproducibility [14]. The interobserver agreement of staining intensity (negative, weak, moderate, and strong) was determined using the kappa coefficient (κ) [15]. The overall κ coefficient measures the reliability of categorical data while taking into account the probability that both observers achieved the same scores by chance [1]. The weighted κ may be used as a measure of inter-rater agreement for ordinal variables and quantifies the relative difference between them. The greater the difference is between the scores, the lower the weighted κ. The interpretation of κ is commonly made as follows: values between 0.81 and 1.0 represent “almost perfect” agreement, 0.61 and 0.80 “substantial” agreement, 0.41 and 0.60 “moderate” agreement, 0.21 and 0.40 “fair” agreement, and 0 and 0.20 “slight” agreement [15].

Univariate analysis

Logistic regression analysis was used to determine the association of percent positivity or staining intensity with T stage (early = T1 + T2/late = T3 + T4), N stage (absence [N0] or presence [>N0] of lymph node involvement), tumor grade (low = G1 + G2/high = G3), and vascular invasion (absence or presence). Survival analysis was carried out using Cox proportional hazards regression.

Multivariate analysis

Staining intensity and percent positivity were entered into a multivariate logistic regression model for all binary outcomes, whereas multiple Cox proportional hazards regression was performed for survival analysis. Adjusted P values for percent positivity and staining intensity were obtained.

P values less than or equal to 0.05 indicate a significant association of percent positivity or staining intensity with the outcome. All analyses were carried out using SAS version 9.1 (The SAS Institute, Cary, NC).

Results

Interobserver reproducibility of percent positivity and staining intensity

The reproducibility of scores expressed as percent positivity was very strong for p53 (ICC = 0.91). The ICCs for APC (ICC = 0.85), β-catenin (ICC = 0.78), and membranous Her2/neu (ICC = 0.71) suggest excellent consistency of scores between observers. The interobserver agreement for cytoplasmic Her2/neu scores was only slightly lower than for its membranous counterpart (ICC = 0.68). However, the reliability of membranous and cytoplasmic EGFR expression was only moderate to low. The same observers independently re-evaluated EGFR expression a second time by scoring the number of immunoreactive tumor cells without regard to localization of staining (i.e., by scoring either membrane and/or cytoplasmic EGFR). The interobserver agreement was significantly increased (ICC = 0.86) [33].

The agreement of staining intensity between observers using the overall κ coefficient was only moderate for APC (κ = 0.41) and membranous and cytoplasmic Her2/neu (κ = 0.53 and 0.57, respectively). The reproducibility of staining intensity for β-catenin was determined to be fair (κ = 0.34) while that of membranous or cytoplasmic EGFR was poor (κ = 0.11 and 0.12, respectively). Analyses with the more generous weighted κ did little to improve these findings (Fig. 1; Table 1).

Fig 1
figure 1

al Weak (left), moderate (center), and strong (right) staining intensity of nuclear p53 (40×; ac), membranous EGFR (40×; df), cytoplasmic EGFR (40×; gi), and membranous Her2/neu (40×; jl). mu Weak (left), moderate (center), and strong (right) staining intensity of cytoplasmic Her2/neu (40×; mo), cytoplasmic APC (40×; pr), and nuclear β-catenin (40×; su)

Table 1 Interobserver agreement for percent positivity (percent scores) measured by the intraclass correlation coefficient (ICC) and for staining intensity measured by the overall and weighted kappa coefficient (κ)

Association of protein expression with clinico-pathological features

Univariate analysis of percent scores

The evaluation of percent positivity with the clinico-pathological features identified significant associations between p53 and T stage (P = 0.007) and tumor grade (P = 0.005), membranous EGFR expression and T stage (P = 0.005), N stage (P = 0.002), tumor grade (P = 0.014), and survival (P < 0.001), and cytoplasmic EGFR expression and survival (P = 0.01). APC expression was correlated with T stage (P = 0.023) and β-catenin expression with tumor grade (P = 0.035), vascular invasion (P = 0.008), and survival (P = 0.004). There were no associations of Her2/neu with any of the clinico-pathological features (Table 2).

Table 2 Association of protein expressed as percent positivity with clinico-pathological features by univariate analysis (P value)

Univariate analysis of staining intensity

All eight significant associations of staining intensity with the clinico-pathological features were previously established by percent positivitiy. The associations of membranous EGFR and tumor grade, β-catenin, vascular invasion, and survival were not present (Table 3).

Table 3 Association of staining intensity with clinico-pathological features by univariate analysis (P value)

Multivariate analysis of percent positivity and staining intensity

The combined analysis of percent positivity with staining intensity identified five associations between the proteins and clinico-pathological features that were previously found using only the percentage of positive cells. In addition, the remaining associations found to have statistical significance in univariate analysis of percent positivity (p53 with T stage, membranous EGFR with T stage, N stage, and tumor grade, APC with T stage, and β-catenin with tumor grade) were no longer observed in combination with the degree of intensity. In only 1 of the 35 analyses (2.8%) did staining intensity provide additional information about the association of the protein with the outcome (β-catenin with N stage; Table 4).

Table 4 Association of IHC expressed as percent positivity and clinico-pathological features in multivariate analysis with staining intensity (adjusted P values)

Discussion

The results of this study confirm that the evaluation of percent positivity for nuclear p53, cytoplasmic APC, nuclear β-catenin, and membranous and cytoplasmic Her2/neu expression is highly reproducible among pathologists. The assessment of EGFR expression resulted in strong interobserver agreement when cytoplasmic and/or membranous immunoreactivity were scored together rather than in their separate localizations. The intensity of staining was not reproducible for the proteins in this study.

In the univariate analysis, protein expression assessed as percent positivity resulted in 11 significant associations between the proteins and clinico-pathological features. Eight of these 11 were also demonstrated using only the degree of staining intensity. However, more than half of the associations identified by percent positivity alone were lost when staining intensity was also analyzed.

Scoring systems for tumor markers in CRC are typically based on some measure of the number of positive tumor cells and often combined with a degree of staining intensity. However, Atkins et al. [2] demonstrated using an anti-EGFR antibody in head and neck cancer, non-small cell lung carcinomas, and colorectal adenocarcinoma that the degree of staining intensity varied by tumor type, was partially influenced by the choice of fixatives, and was inversely correlated with storage time of the unstained tissue sections. These factors in addition to the variation in IHC protocols inevitably contribute to the subjective nature of staining intensity. Contradictory results from different reports on the same tumor markers may be partially explained by this subjective assessment of immunoreactivity [23, 26].

We have previously demonstrated that a descriptive, semiquantitative scoring system based on the percentage of positive tumor cells (percent positivity) is reproducible and has several advantages over standard scoring methods based on predetermined cutoff scores [1719]. First, this scoring system allows a more thorough assessment of the predictive or prognostic significance of tumor markers by evaluating the entire range of protein expression levels (from 0 to 100%). Moreover, by quantifying protein expression at the outset, more biologically and clinically relevant cutoff scores for tumor positivity can be established by, for example, performing ROC curve analysis [34]. This method has been used to select cutoff scores for tumor markers macrophage stimulating factor 1, Raf-1 kinase inhibitor protein, receptor of hyaluronic acid-mediated motility (RHAMM), APAF-1, EGFR, as well as for several others involved in transforming growth factor β signaling in CRC [4, 20, 21, 32, 33]. Additionally, the correlations between various proteins can be assessed. We have recently shown using this scoring method that the percentage of pERK-positive tumor cells is strongly associated with increases in RHAMM expression supporting the hypothesis of a RHAMM–mitogen-activated protein kinase interaction in MMR-proficient CRC [17]. By percent scoring, we have also described how classification and regression tree methods could be used to select proteins playing a role in predicting rectal tumor response to preoperative radiotherapy [30]. Finally, this descriptive scoring method avoids an often complex and interpretative composite scoring system based on the intensity of staining. One such method includes a four-tier scoring of the intensity of staining (0, 1+, 2+, 3+) coupled to either the mean percentage of positive tumor cells or to a categorical measure of the percentage of positive tumor cells (for example, 1–10%, 10–50%, and >50) [3, 6, 9, 12, 29]. A graded scoring system has also been used where the percentage of positive tumor cells is categorized (0 = no positivity, 1 = 1–25%, 2 = 25–50%, 3 = >50%) and multiplied by the degree of intensity (0, 1, 2, 3) to obtain a score that is then dichotomized into “low” or “high” expression (low = score < 6 and high = score ≥ 6) [26]. Others have reported only the degree of staining intensity regardless of the proportion of immunoreactivity or considered only staining intensities of 2+ or 3+ as positive for protein expression [24].

The purpose of this study was not to evaluate the prognostic significance of several tumor markers in CRC but rather to determine whether staining intensity is a useful indicator of immunoreactivity in colorectal tumors. In addition, the study focused on whether staining intensity provides independent information on the association of the protein with clinico-pathological features beyond that which can be obtained from the semiquantitative assessment of immunoreactivity. The markers included in this study are well established and/or of current interest as prognostic factors. They were selected to provide a range of subcellular localizations for scoring purposes (cytoplasm, cell membrane, nucleus) as well as representing both tumor suppressors (p53 and APC) and oncogenes (β-catenin, Her2/neu, and EGFR).

TMA technology allowed us to analyze more than 1,000 CRCs using only three slides. One tissue sample (0.6 mm) per tumor was obtained. Although it is argued that a single tissue core may not be representative of the whole tumor, results using one sample appear to approximate those from larger tissue sections as more samples are analyzed. In fact, even larger tissue sections may contain only a small fraction of the entire tumor mass (1/10,000) [24]. Goethals et al. [9] reported that four core biopsies are sufficient to account for tumor heterogeneity. Because the inclusion of several punches per tumor is not always possible, a larger series of tumors should compensate for tumor heterogeneity as was the case in this study. Several studies have shown well-established associations between molecular features and clinico-pathological endpoints in TMAs using only one spot per tumor [5, 22, 28]. Most importantly, evaluating a single tumor punch may lead to a more reliable analysis of interobserver agreement for both percent positivity and staining intensity, as precisely the same area of tumor is scored by each observer [24].

The results of our study suggest that staining intensity is not an independent measure of protein expression for the markers in this study. Additionally, the evaluation of immunoreactivity using a semiquantitative scoring method appears to be sufficient for establishing associations of the selected tumor markers with most clinico-pathological features.