Introduction

Proteins overexpressed on the surface of tumor cells can be selectively targeted. Epidermal growth factor receptor (EGFR) and human epidermal growth factor receptor 2 (HER2) are among the most common targeted proteins in cancer therapy. Accurate determination of HER2 or EGFR expression is essential for the prediction of a cancer patient’s response to therapy and prognosis (Montemurro and Scaltriti 2014; Fusco et al. 2013). To make early, rational, and correct clinical decisions, a reliable detection of biomarkers is essential. Particularly, for HER2 evaluation, the ability to reliably identify patients with breast cancer or gastric cancer who might benefit from trastuzumab treatment is not only important for clinical reasons (high proportion of cardiotoxicity), but also for economic ones well (about 50,000–60,000 euros per treatment/quality-life adjusted year). The definition of HER2-positive status affects the clinical- and cost-effectiveness of trastuzumab (Spackman et al. 2013; Norman et al. 2011). Manual scoring is routinely used to assess protein expression based on staining intensities and distribution patterns. However, this approach is subjective and is prone to significant intra- and interobserver variability that limits reproducibility and statistical confidence (Braun et al. 2013; Dobson et al. 2010; Mulrane et al. 2008). Developed in response to this disadvantage, digital image analysis techniques offer an objective approach and the possibility for increased sensitivity and improved accuracy (Mohammed et al. 2012b; Dobson et al. 2010).

In this study, we analyzed five membrane-binding biomarkers: HER2, which is clinically approved for therapy response prediction, and EGFR, pEGFR, β-catenin, and E-cadherin which are used in tissue-based research. We applied these five biomarkers on tissue microarrays (TMAs) containing tumor tissue from 153 esophageal adenocarcinoma patients. We compared the evaluation of immunohistochemistry (IHC) by using digital image analysis with visual scoring of these five biomarkers. Kaplan–Meier survival analysis of disease-free and overall survival as clinical end points was used to determine and compare the prognostic significance of the two approaches.

Materials and methods

Patient selection and tissue samples

All patients (n = 153) underwent primary surgical resection between 1995 and 2005 without chemotherapy or radiotherapy at the Department of Surgery, Klinikum Rechts der Isar, Technische Universität München. All cases had adenocarcinomas of the distal esophagus (Barrett’s cancer) associated with histopathologically identified Barrett’s esophagus (according to WHO 2000 criteria). Data were acquired after receiving approval from the ethics committee of the Technische Universität München. Survival analyses results were calculated on all 153 patients. The mean age at the time of surgery was 64 years, and the maximum follow-up time was 164 months (median disease-free survival time = 31 months; median overall survival time = 33 months). The results describing the clinicopathologic characteristics of the collective are presented in Table 1.

Table 1 Clinicopathologic parameters of the collective

Immunohistochemistry

Preparation of TMAs was performed as previously described, generating triplicate cores with a diameter of 1.0 mm each (Berezowska et al. 2013). Serial tissue sections were cut (5 µm) and transferred to slides (Rauser et al. 2007). The sections were incubated with antibodies specific for β-catenin (1:500, BD Bioscience, Heidelberg, Germany), E-cadherin (1:1,500, BD Bioscience, Heidelberg, Germany), pEGFR (Y1086) (1:100, Invitrogen, Karlsruhe, Germany), EGFR (pharmDx™-kit, DAKO, Hamburg, Germany), and HER2 (HercepTest kit, DAKO, Hamburg, Germany). Immunohistochemical staining was performed using a Discovery XT automated stainer (Ventana, Tucson, AZ, USA). Positive and negative controls were included in each staining procedure.

Only cores with technically unequivocal staining results and with sufficient tumor content (>50 tumor cells) were used for visual scoring and image analysis.

Visual scoring

Determination of HER2 expression was assessed according to the published guidelines for routine HER2 evaluation, and each sample was classified as 0, 1+, 2+, or 3+ (Wolff et al. 2013; Ruschoff et al. 2010; Hofmann et al. 2008). The immunohistological expression of the other membrane-binding proteins (i.e., EGFR, β-catenin, E-cadherin, and pEGFR) was classified into four levels according to HER2 evaluation (Fig. 1). Cytoplasmic staining was considered a non-specific result and was not included in the scoring. The evaluation was performed by two independent pathologists (RL and AW). In the case of a disagreement, discussion was used to reach a consensus.

Fig. 1
figure 1

Results of β-catenin as an example of immunohistochemical staining on esophageal adenocarcinoma tissues. a Visual score 0, b visual score 1+, c visual score 2+, d visual score 3+

Digital image analysis

All stained slides were scanned at 20 × objective magnification using a Mirax Desk digital slide scanner (Carl Zeiss MicroImaging, Munich, Germany). Immunohistochemical membrane staining results were quantified using the commercially available image analysis software Definiens Developer XD2 (Definiens AG, Munich, Germany). This software allows to detect and quantify the immunohistochemical staining intensities in different cellular compartments (e.g., membranes) within a user-specified region of interest of the tumor cells. Algorithms were developed and modified specifically for each marker based on semantic and context-based segmentation processes, which include staining intensity, color features, shape, area, and neighborhood. The quantified parameter for EGFR, β-catenin, E-cadherin, pEGFR, and HER2 was a value representing a point on a continuous spectrum of average brown staining intensity in relative units (Fig. 2). A mean value was calculated from triplicate tissue cores from each patient.

Fig. 2
figure 2

An example of digital image analysis assessment of strong (ac) and weak (df) β-catenin staining of esophageal adenocarcinoma tissue. a, d Original image, b, e detection of membrane (black), cytoplasm (green), nuclei (blue), c, f classification of membrane staining intensity (heat map for membrane staining intensity: from weak (yellow) to strong (red))

The results were correlated with patient survival times and compared with the visual scoring results.

Statistical analysis

Disease-free and overall survival rates were calculated using the Kaplan–Meier method and include median and 95 % confidence interval estimates. Survival curves were tested with the log-rank χ 2 value and Cox proportional hazards regression analysis. In each case, the cutoff point was optimized with respect to the end point. Correlation analyses were performed using the Pearson rank test. Calculations were performed using the statistical data analysis system R (‘stats’ and ‘survival’ procedures; Bell Laboratories, Muray Hill, NJ, USA) and SAS statistical software version 9.2 (SAS Institute Inc, Cary, NC, USA). All tests were two-sided, and P values <0.05 were considered to be statistically significant.

Results

Comparison of digital image analysis and visual scoring with pTNM and G status

The results for the correlation analysis of visual scoring and computer-assisted evaluation of IHC expression with pTNM and G status are summarized in Table 2. There was a significant correlation for visual scoring of E-cadherin and pEGFR1086 with pN-status (E-cadherin: P = 0.008, pEGFR: P = 0.0197) and based on image analysis with pT-status (E-cadherin: P = 0.0046, pEGFR: P = 0.0339) and pN-status (E-cadherin: P = 0.0115, pEGFR: P = 0.0125). For EGFR, the results indicated that there was a significant correlation with pT-, pN-, and pM-status based on visual scoring (pT: P = 0.0014, pN: P = 0.0125, pM: P = 0.0076) and digital image analysis (pT: P = 0.0004, pN: P = 0.0284, pM: P = 0.0119). The results for the HER2 analysis indicated that there was a significant correlation between evaluation with digital image analysis and G status.

Table 2 Results for the correlation analysis of visual scoring and computer-assisted evaluation with the pTNM and G status

The results for the comparison between the visually scored evaluation and the continuous values generated using digital image analysis of the membrane-binding proteins (β-catenin, E-cadherin, pEGFR, EGFR, and HER2) to determine the level of agreement between both methods are presented in Table 3. The Pearson rank correlation coefficients were 0.71, 0.68, 0.89, 0.79, and 0.74 for β-catenin, E-cadherin, pEGFR, EGFR, and HER2, respectively (a P < 0.0001 for all comparisons).

Table 3 Level of agreement between visual scoring and computer-assisted evaluation

Comparison of digital image analysis with survival analysis

We also performed a comparative survival analysis. Based on the visual evaluation, β-catenin overexpression (score 3+) was found in 44.6 % of the patients, and no or weak expression (score 0, 1+, or 2+) in 55.4 % (Table 4). There were no significant correlations between β-catenin expression and disease-free survival (P = 0.4511) or overall survival times (P = 0.1868) (Fig. 3).

Table 4 Visual scoring of five proteins using a four-step scoring system (0, 1+, 2+, 3+)
Fig. 3
figure 3

Survival analysis example: β-catenin expression in esophageal adenocarcinomas: a disease-free survival (DFS) using visual evaluation, b disease-free survival using image analysis, c overall survival (OS) using visual evaluation, d overall survival using image analysis

The results for HER2 and pEGFR were similar to the results for β-catenin. Using visual assessment, there were no statistically significant prognostic effects on disease-free survival (HER2: P = 0.0560; pEGFR: P = 0.2848) or overall survival times (HER2: P = 0.0640; pEGFR: P = 0.2982) (Table 5). A strong HER2 expression (score 3+) was detected in 9.4 % of the patients, but most (90.6 %) had moderate or no expression of HER2 (score 0, 1+, or 2+), and 29.4 % of the patients showed an overexpression (score 1+, 2+, or 3+) of membrane-bound pEGFR, and for 70.6 %, we found no signal (score 0) (Table 3).

Table 5 Comparison between visual and computer-supported assessment of univariate survival analysis of five proteins in patients with esophageal adenocarcinoma

Although there were no statistically significant associations with patient survival and β-catenin, pEGFR, or HER2 (P > 0.05), based on visual scoring, there was significant association between biomarker analyzed by digital image analysis and disease-free survival (β-catenin: P = 0.0125; HER2: P = 0.0096 and pEGFR: P = 0.0299) and overall survival (β-catenin: P = 0.0063; HER2: P = 0.0022 and pEGFR: P = 0.0135) (Table 5). β-Catenin and pEGFR overexpression were significantly associated with good prognosis, and HER2 overexpression was associated with a poor clinical outcome.

Based on visual evaluation, overexpression of E-cadherin was significantly positively associated with overall survival (P = 0.0145), but the association with disease-free survival was not statistically significant (P = 0.0526). An overexpression (score 3+) was present in 40.4 % of the patients, and a weak or no expression (score 0, 1+, or 2+) was present in the remaining 59.6 % of the patients. By using digital image analysis for the evaluation of E-cadherin, we were able to enhance the visually assessed level of significance for correlation with both overall survival (P = 0.0004) and disease-free survival (P = 0.0014), resulting in statistically significant good prognosis for high E-cadherin expression.

The Kaplan–Meier analysis of EGFR revealed that overexpression (score 1+, 2+, or 3+) was characterized by a poorer prognosis, in terms of overall survival (P = 0.0091) and disease-free survival (P = 0.0045). Accordingly, for 22.1 % of the patients, we observed a high expression of EGFR (score 1+, 2+, or 3+) and for 77.9 % no expression (score 0). For EGFR, the level of significance for visual evaluation was greatly increased for both disease-free survival (P > 0.0001) and overall survival (P > 0.0001) by using digital image analysis.

For each of the five membrane biomarkers, the results indicated that compared with visual scoring, biomarkers evaluated by digital image analysis had a greater association with both disease-free and overall survival. Based on digital image analysis, we obtained statistically significant results for the survival analyses of all of the five biomarkers. The results indicated that there was a positive association with clinical outcome for β-catenin, E-cadherin, and pEGFR. Overexpression of HER2 or EGFR was associated with poor prognosis.

Combinatorial survival analysis

We performed for all markers a combinatorial survival analyses of two markers each, based on computer-assisted evaluation. Therefore, patients with good prognosis for both markers or poor prognosis for both markers, respectively, were taken in one group for survival analysis. The results indicated that a combinatorial evaluation of HER2 and E-cadherin is more accurate in predicting patient overall survival (P = 0.0002) and disease-free outcomes (P = 0.0003) compared to single marker assessment (Table 5). For the remaining marker combinations, no increased significance was observed.

Discussion

There is a considerable need for standardized evaluation of biomarker expression in clinical pathology and in tissue-based research. Currently, the grading and assessment of biomarkers are usually performed using manual scoring systems. Visual evaluation of IHC remains a rather subjective process characterized by significant intra- and interobserver variability and reduced reproducibility, which may lead to inaccurate diagnosis and treatment decisions (Dolled-Filhart et al. 2010; Gudlaugsson et al. 2012; Laurinavicius et al. 2012; Adams et al. 1999). In particular, accurate determination of HER2 status by reducing the number of false positives or false negatives is critical to reliably identify patients who might benefit from trastuzumab treatment. Improvements in this area will reduce human suffering and healthcare costs (Dobson et al. 2010).

Digital image assessment has a huge potential to overcome the limitations of manual scoring systems. Numerous studies in the last two decades have compared visual and computer-assisted scoring, but most of them are limited by comparing just the level of agreement between both methods (Braun et al. 2013; Lloyd et al. 2010; Rizzardi et al. 2012; Tuominen et al. 2012; Nassar et al. 2011; Messersmith et al. 2005), although other studies compared visual scoring and IHC evaluations by digital image analysis using survival analysis, but without confirming an analytical advantage by using image analysis (Mohammed et al. 2012a, c; Ong et al. 2010; Turashvili et al. 2009). In contrary, in our approach, we investigated the association between IHC evaluation and patient outcome, and we found that biomarker evaluation using digital image analysis provided prognostic information beyond that attainable with conventional visual methods. Using image analysis, we found higher associations for biomarker expression with clinical outcome in each case (HER2, EGFR, pEGFR, β-catenin, E-cadherin).

One reason for that is that IHC measurements using digital image analysis are more precise, particularly for moderate staining intensity or heterogeneous IHC expression within the tissue. The intensity of immunoreactivity in tissue is a crucial criterion for scoring. However, visual recognition is subjective, even for the trained pathologist, and the evaluation is performed in a more approximate rather than an accurate manner (Ong et al. 2010; Dobson et al. 2010; Rimm 2006). Visual evaluation is based on distinguishing groups of similar quality, such as color intensity or size, which is prone to misclassification when these qualities slight change. Interobserver variability for HER2 classification tends to be higher when discriminating 1+ and 2+ or 2+ and 3+ cases, while negative cases are straightforward to score (Turashvili et al. 2009). Additionally, the visual approach can be biased resulting in inaccurate scoring values.

Whereas visual assessment is restricted to a pre-defined scoring system consisting of discrete values (0, 1+, 2+, 3+) for statistical analysis, image analysis produces continuous data, which considerably improves the precision of the cutoff value used for survival analysis. This difference resulted in a higher prognostic significance. Continuous variable data make statistical evaluation much more flexible and enable identification of IHC cut points of prognostic relevance that were either undetected or had a lower statistical significance when visual scores were used (Braun et al. 2013; Rizzardi et al. 2012).

Additionally, benefits of a digital image analysis-based approach include objectivity, reproducibility, and reliability of scoring results, which allows for a more robust and standardized IHC evaluation (Minot et al. 2012; Vayrynen et al. 2012; Skaland et al. 2008a).

In recent years, the capabilities of advanced, computer-assisted image analysis have improved significantly and include complex algorithms that are used to interpret images for IHC quantification (Webster and Dunstan 2014; Kayser and Kayser 2013; Foran et al. 2011; Kayser et al. 2009; Rojo et al. 2009). The results of our study confirm that digital image analysis enables investigators to become more sensitive with regard to the prognostic value and to increase the quality of IHC quantification. Using digital image analysis, we were able to identify the biological effects within the tissue much more accurately by providing important data not available when conventional visual approaches are used.

Rimm and co-workers (Welsh et al. 2011; Harigopal et al. 2010; Camp et al. 2003) also found that digital image analysis appears to be more sensitive than manual IHC analysis. In comparing HER2 IHC with fluorescence in situ hybridization (FISH), recent studies report an increased concordance with FISH, and a decrease in the number of cases interpreted as equivocal (2+) when image analysis is used (Minot et al. 2012; Dobson et al. 2010; Skaland et al. 2008b).

In practice, quality control methods are still required during the data analysis and interpretation. Digital image analysis may provide more detailed information and improve quality, but digital image analysis algorithms cannot reliably differentiate between benign and malignant lesions, or recognize artifacts, such as tissue folds, with the same precision as an experienced pathologist.

Based on the findings of this study, digital image analysis has great potential as a diagnostic support tool and may significantly improve the sensitivity and standardization of IHC evaluation.