Keywords

Deep Learning based computational histopathology has demonstrated the ability to enhance patient healthcare quality by automating the time-consuming and expensive task of analyzing high-resolution WSIs [4, 9]. The analysis involves identifying tissue or cellular level morphological features highlighted by staining dyes like Hematoxylin and Eosin (H &E). The stain color distribution of a WSI depends upon factors including the process of tissue preparation, dye manufacturer, and scanning equipment. As a result, there exists a high variability in the appearance of histopathology images, as seen in Fig. 1. Stain variations may appear between WSIs scanned at different centers as well as within the same center [2]. Often the model training data is obtained from single lab, but the model is deployed across multiple other labs. This domain shift hampers the performance of deep learning models on out-of-distribution samples [11, 22, 29], as seen in Fig. 2a.

Fig. 1.
figure 1

Variation in stain color distribution between data from different labs. The graph shows the mean and standard deviation of pixel intensity for all image patches in the respective dataset, in the HSV color space. Best viewed in color.

Related Work: Existing solutions reduce the effect of variation in stain distribution by normalizing stain color or by improving the model’s generalization on the test set. Traditional stain normalization approaches [17, 26] normalize the target domain images by matching their color distribution to a single reference image from the source domain. Recently, generative deep CNN [3, 8, 13, 19, 20] have been trained to perform image-to-image domain transfer, learning the color distribution from the entire set of source domain images. However, the regenerated images may display undesired artifacts [3, 7], leading to misdiagnosis. Furthermore, stain normalization significantly adds to the computation cost as each image needs to be pre-processed. Model generalization to stain variations can also be improved by learning stain agnostic features using Domain Adversarial training [12, 15] or by using color augmentation to simulate variations in stain [6, 10, 25, 27, 28]. These methods provide better generalization without the need for a dedicated normalization network during inference.

Many state-of-the-art unsupervised approaches [3, 6, 16, 27, 28, 31] rely on unlabeled images from the test set to generalize the model to a target distribution. [6, 27, 28] adapt the training data image stain based on samples from test lab data to use as augmentations, [3, 20] require images from target domain to learn a stain normalization model, where as [16, 31] use semi-supervised models for unsupervised domain adaptation, learning from unlabelled target domain images. Similarly, [31] showed that semi-supervised learning methods like [21, 23, 30] can learn from a consistency loss between real and a noisy version of the unlabeled target domain images. However, computational histopathology models need to be invariant to unseen intra-lab as well as inter-lab stain variations for application in patient care. In this work, we propose a novel strategy that learns stain invariant features without requiring any knowledge of the test data distribution. The goal of our approach is to learn a feature space that has a high overlap between the distribution of training and unseen test set with stain variations.

Fig. 2.
figure 2

The figure shows scatter plots of 2D features obtained using UMAP [18] for validation data from RUMC Lab and Test set data from CWZ, RST, UMCU and LPON labs for Camleyon [1] dataset. Symbols \(+\) and \(\bullet \) represents class 0 and 1 respectively. For test set data, Vanilla model mis-classifies samples, represented by \(+\) over-lapping with \(\bullet \) and vice-versa. Stain-AgLr model classifies a significantly smaller number of samples incorrectly, with only a few \(\bullet \) over-lapping with \(+\). This shows that the feature space produced by proposed Stain-AgLr model shows a high overlap between the validation and test distributions.

Our Contributions: Stain Invariance is induced with a two pronged strategy, as shown in Fig. 3. We use a stain altered version of an image to mimick test samples from a different lab, and impose consistency between the prediction of the raw image and its stain altered version, by penalizing their relative entropy. The consistency regularization loss enforces similar feature space representation for differently stained version of the same images, that is, the model learns that the difference in stain color has no bearing on the prediction task. In parallel, a generator network is tasked to regenerate the original stain from feature space representations of the stain altered image, that is perform stain normalization as an auxiliary task. We show that the two tasks are complementary and facilitate the model to learn features invariant to stain variations. During inference, only the underlying model for tissue analysis is used, without adding any computational overhead.

We compare the proposed method with state-of-the-art stain normalization as well as stain augmentation methods. We show that Stain-AgLr achieves better generalization on unseen stain variations based on evaluation on two histopathology datasets. The increased stain invariance is a result of high overlap between train and test domain data in feature space produced by Stain-AgLr. To the best of our knowledge, our work is the first to employ stain normalization as an auxiliary task rather than a preprocessing step and show that it leads to improved generalization on unseen test data with stain variations. Furthermore, the inference time corresponds to that of only the classification network which is significantly lower compared to stain normalization methods.

Fig. 3.
figure 3

Proposed approach for learning features invariant to stain variations in histopathology images. During the training phase, the model learns from three supervisory signals - stain regeneration loss between raw and color altered images, consistency regularization loss between logits of raw and color augmented image, and classification loss. During inference, only the layers required for classification are used.

1 Method

We train a classification network that shares a feature extractor with a stain regeneration network. In addition to Cross-Entropy Loss, the network is supervised by two loss functions - Consistency Regularization Loss and Stain Regeneration Loss.

1.1 Model Architecture

Let \(M_{ft}\), \(M_{cls}\) & \(M_{gen}\) represent the feature extractor, the classifier, and the generator respectively. Together, the networks \(M_{ft}\) and \(M_{gen}\) constitute a stain regeneration network that learns the mapping to regenerate the stain color distribution of the training images from stain altered images. On the other hand, the network \(M_{ft}\), in conjunction with \(M_{cls}\), classifies the input image into a set of task specific classes. Global average pooling (GAP) and Dropout (50%) is applied on the output of the \(M_{ft}\), which generates a feature vector that is the input to \(M_{cls}\). During inference, only the classification network is used, with layers particular to the stain regeneration network removed. We follow CNN architecture provided by [10, 25]. Details are provided in Supplementary Material.

1.2 HED Jitter

We employ HED jitter to generate stain altered histopathology images [24, 25]. This results in samples that resemble data from different sources. As shown in Fig. 3, the model is fed with both raw and HED jittered images in a single batch. The altered image is utilized to learn stain invariant features by matching logits of the altered image to the original image, as well as train the stain regeneration network which regenerates the raw image. We use HED-light [25] configuration for Stain-AgLr with default parameters, including morphological and brightness-contrast augmentations.

1.3 Loss Functions

Two loss functions - Consistency Regularization and Stain Regeneration are used to train the Stain-AgLr model, in addition to task specific Cross-Entropy Loss. We guide the model to produce similar predictions for a raw image and its stain altered version using the Consistency Regularization Loss (\(L_{Cons}\)). Specifically, we minimize the divergence \(D(P_\theta (y|x) || P_\theta (y|x, \epsilon )) \), where D is Kullback-Leibler (KL) divergence loss, y is the ground truth label corresponding to input x and \(\epsilon \) represents stain color noise. This enforces the model to be insensitive to stain color noise.

The features of the altered image from \(M_{ft}\) are passed to the \(M_{gen}\) which generates a stain color to match the raw image. We use MSE loss as the Stain Regeneration Loss (\(L_{Reg}\)) between the raw image and the regenerated image. This auxiliary task helps the model improve generalization on images with stain variations, by learning shared features useful for both classification and regeneration tasks. As a result, a combination of three loss functions is used to train the model.

$$\begin{aligned} {L} = L_{CE} + \lambda _1 L_{Reg} + \lambda _2 L_{Cons} \end{aligned}$$
(1)
$$\begin{aligned} L_{Cons} = D_{KL} \big ( M_{cls}(M_{ft}({x})) \, || \, M_{cls}(M_{ft}(\hat{x})) \big ) \end{aligned}$$
(2)
$$\begin{aligned} L_{Reg} = \frac{1}{W*H}\sum _{i=1}^{W*H} ||M_{gen}(M_{ft}(\hat{x})) - x||^2 \end{aligned}$$
(3)

where \(\lambda _1\) and \(\lambda _2\) are weights, x and \(\hat{x}\) are the raw and stain altered images.

2 Experiments

2.1 Setup

We evaluate Stain-AgLr using two publically available datasets - TUPAC Mitosis Detection and Camelyon17 Tumor Metastasis Detection. Both datasets segregate images based on the lab of origin. This allows models to be trained on data from a single lab, while unseen data samples from other labs are utilized to test the model’s robustness to stain variations.

Camelyon17. [1]. The dataset contains H &E stained WSI of sentinel lymph nodes from five different medical centers: In our experiments, 10 WSI are used from each center for which annotation masks are available. 95500 (256\(\,\times \,\)256 size) patches were created at 40x magnification, of which 48300 represent metastasis. Patches from RUMC were used for training and validation, remaining centers (CWZ, UMCU, RST, LPON) are used as test sets.

TUPAC Mitosis Detection. [5]. The dataset consists of 73 breast cancer cases from three pathology centers. The first 23 cases were obtained from a single center, whereas cases 24–48 and 48–73 were collected from two other centers. We use binary labels provided by [5], which comprises of 1,898 Mitotic figures and 5,340 Hard Negative patches of size 128\(\,\times \,\)128. For training and validation, we use samples from the first 23 cases and separately report performance on the other two subsets.

Training Setup: We use an initial learning rate of 5e-3 for TUPAC dataset and 1e-2 for Camelyon dataset, obtained using a grid search. In the case of TUPAC, we sample an equal number of images per batch for both classes to mitigate the effect of class imbalance. All models are trained using the Adam optimizer with a batch size of 64, reducing the learning rate by a factor of 0.1 if the validation loss does not improve for 4 epochs for Camelyon and 15 epochs for TUPAC. The training was stopped when the learning rate dropped to 1e-5. For each run, we select the model weights corresponding to the model with the lowest validation loss. We found re-weighing factors \(\lambda _1=0.1\) & \(\lambda _2=10\) in multi-task loss (Eq. 1) gave the best performance. Geometric augmentations including random rotation in multiples of 90\(^\circ \), horizontal and vertical flipping were employed for all models. HED-Light Augmentation is used with default parameters as described in [25], including morphological and brightness contrast augmentation. Models were trained using NVIDIA Tesla A100 GPUs using PyTorch Library.

2.2 Evaluation

We conduct experiments to compare our proposed method with stain normalization as well as color augmentation techniques for improving model generalisability on unseen test data. Tables 1 and 2 report AUC scores on test data obtained from multiple labs with differing stain color distributions, along with standard deviations. All experiments are repeated ten times, with different random seeds.

Table 1. Classification AUC scores on Camelyon17 Metastasis Detection dataset. The model is trained on data from RUMC lab, evaluated on test set from labs with unseen stain variations.

Vanilla model does not use any stain normalization or stain augmentation. Vahadane [26], STST GAN[19] and DSCSI GAN[13] represent classifier performance on normalized images using the corresponding stain normalization method. Both [13, 19] do not require samples from target domain during training the GAN. HED-Light Aug represents a model trained with HED-Light augmentation. Lastly, Stain-AgLr represents the proposed approach.

Table 2. Classification AUC scores on TUPAC Mitosis Detection dataset. The models were trained on data from Center 1, and data from Center 2 & 3 is used as test set with unseen stain variations.

3 Discussion

Classifiers trained on data from one lab show poor performance on data from other labs. The performance degradation depends upon the deviation of stain color distribution from the training set. All stain normalization methods improve the classifier performance, thus inducing invariance to stain color changes in the downstream classification model. Both GAN-based approaches [13, 19] provide better stain normalization, learning stain color distribution from the entire training set unlike [26] that uses a single reference image from the training set. We observe that, a classifier trained with HED-light augmentation matches or out performs deep learning-based stain normalization approaches, as also reported by [22, 25]. This indicates that the vanilla model overfits the color stain information from a single lab, which is alleviated by use of color augmentation.

Impact of Consistency Regularization and Stain Regeneration Loss. Stain-AgLr outperforms models trained using stain normalization algorithms as well as HED-light augmentation. The enhancement in stain invariance is contributed by both Consistency Regularization as well as the Stain Regeneration Loss. Using the loss individually improves model performance over model trained with HED-Light augmentation, however, best results are obtained by the network trained using both the loss functions together. This demonstrates that the two tasks are complementary to one another for learning stain invariant features. The multi-task network learns shared representation that is less likely to over-fit on the noise in the form of stain color.

Stain invariance of Stain-AgLr can be further established by analyzing the distribution of validation set and test data in feature space using UMAP plots, visualized in Fig. 2. For the test set data, the plots show significantly better class separation produced by the proposed Stain-AgLr model as compared to the Vanilla model. Importantly, the distribution of class samples from the test set corresponds with respective classes from validation data. In other words, the feature space produced by proposed Stain-AgLr model shows a high overlap between the validation and test data distributions. This verifies that quantitative performance gain is obtained by Stain-AgLr learning stain invariant features.

All stain normalization approaches significantly increase the base classifier’s inference time significantly, as seen in Table 2. Although Stain-AgLr employs additional convolution layers during training, the model’s inference time is identical to that of a vanilla classifier. A higher throughput is beneficial in reducing turnaround time in patient diagnostics, especially when processing high resolution Histopathology WSIs, as well as reducing computational requirement for deployment in diagnostics laboratory setup. Thus, the proposed approach combines the best of both worlds: improved stain invariance and fast inference.

4 Conclusion

Invariance to stain color variations in histopathology images is essential for the effective deployment of computational models. We present a novel technique - Stain-AgLr, which learns stain invariant features that lead to improved performance on images from different labs. We also show that Stain-AgLr results in a high overlap between feature space distributions of images with varying H &E staining. Unlike many state-of-the-art techniques, Stain-AgLr does not require unlabelled images from the test data as well as does not add any computational burden during inference.