Keywords

1 Introduction

Staining is a step in microscopic slide preparation that involves staining chemicals to highlight the regions of interest in the tissue/blood sample or bone marrow smear. Stain color variations in the images across datasets collected from various hospitals/centers may arise due to varying illumination conditions, stain chemicals, and staining time. Thus, the diagnostic tool trained on one center’s dataset may perform non-optimally on another center’s dataset. To counter this problem, stain color normalization is employed. Some of the widely used stain normalization approaches are based on histogram equalization [6, 8, 18], color deconvolution [9, 11, 15, 17], and color transfer methods [10, 13, 14]. In histogram equalization methods, the probability density function of the reference image and source image colors are matched for each R, G, and B channel. In [10], color transfer between the two images is used to achieve stain normalization, while in [13], mean and standard deviation of the source and reference images are matched. However, these methods do not utilize histological information. Color deconvolution methods are based on negative matrix factorization (NMF) or singular value decomposition (SVD) to find the stain vectors. In [4] SVD, along with the robust aligning of the reference and query images’ Cartesian frames is utilized to counter stain color variation. A limitation of all these methods is that the performance depends on the reference image’s choice and may change considerably with the change of reference image. Of late, deep learning (DL) methods are gaining significance in this area. Some generative model-based DL methods for stain normalization are discussed in [16, 21, 22]. In [22], InfoGAN is utilized, while in [21], variational-autoencoder and deep convolutional Gaussian mixture model are used for stain normalization. In [16], a CycleGAN based method is used to preserve the structure information while applying the color transformation. However, this method requires reference center’s data that may not always be available. As a remedy, self-supervised learning can be employed that does not require GT [7]. It is based on first training the network on some pretext tasks and later utilizing for the other downstream tasks. In [19], a network is trained to carry out stain normalization by learning to map the color (stain) augmented images to the corresponding original images. This can be considered as self-supervised learning, wherein the pretext task is to learn to transform the color augmented images to the original images. Color transformation of test images to that of training data can be considered as the downstream task. This paper utilizes this approach but uses two coupled subnetworks (N1 and N2) instead of a single network and learns dual transformation (pretext tasks). In addition, classification heads are added to the encoders of both the subnetworks [5, 12]. The first subnetwork (N1) learns an identity transformation, while the second (N2) learns to perform stain normalization. N1 helps N2 learn the context-aware stain normalization task, while N2 helps N1 learn the stain-invariant features for the reconstruction task. Thus, both N1 and N2 assist each other, leading to improved stain-invariant performance on the subsequent classification task. We also experimented with segmentation as a downstream task. Since the proposed architecture can be utilized for stain normalization, classification, and segmentation, we name it an all-in-one network (AION). Elaborate experiments are presented on four datasets for each of the three tasks.

2 Methods

Consider the source image dataset and the corresponding augmented dataset with color variations. The aim is to perform stain normalization such that the stain color profile of the augmented dataset is as close as possible to the source dataset. If trained successfully, the trained model can map one Center’s images to those of another Centre with a matching stain color profile. To accomplish this, we modify the architecture in [19] (here named as AION\(^{- -}\)) to arrive at AION, presented in Fig. 1.

Fig. 1.
figure 1

AION: An architecture with two coupled-networks and classification heads for learning two transformations; one identity and another for stain normalization. AION can also be used for stain-invariant classification and segmentation.

Normalization by the AION Architecture: The AION architecture can be seen as a self-supervised learning architecture consisting of two parallel U-Net types AION\(^{- -}\) coupled through cross-connections running from the encoder units of one to the decoder units of the other. The first subnetwork, N1 is trained to learn the identity transformation on the source dataset. In contrast, the second subnetwork, N2 is supposed to map the augmented image dataset’s stain color profile to that of the source dataset. Each subnetwork consists of an input layer, four encoder units (EUs), four decoder units (DUs), and one output layer. Input layer and encoder units constitute the encoder, while decoder units and output layer constitute the decoder. The input layer consists of 32 filters, while the successive EUs consist of twice the filters than the preceding layer/unit. Hence, the first EU consists of 64 filters, and the fourth EU consists of 512 filters. The size of each filter is \(3 \times 3\). Each convolutional operation in the encoder is performed with a stride of two and a padding of one. This leads to a reduction of output size by two after each convolutional operation. Each EU consists of two parallel convolutions layers and provides a coupling mechanism for N1 and N2. These parallel layers receive the same input from the preceding EU. However, the output from one goes to the succeeding EU of the same subnetwork and the output of the other goes to the decoder unit of another subnetwork. Similar cross-connections are followed for both the subnetworks (Fig. 1).

Each DU of N1 concatenates the input from the preceding layer/unit, the output from the corresponding EU of N1, and the output from the corresponding EU of N2. DUs also involve upsampling by two by the nearest neighbors approach to match the spatial size for concatenation. The concatenated output is given to the convolutional layers within DUs. The final output is forwarded to the next DU. Similar is the decoder structure of N2. Each DU contains half the number of filters of the preceding DU, with the first DU containing 256 filters and the fourth containing 32 filters. Finally, the output layer provides a three-channel output. Each convolutional operation in the network is followed by batch normalization (BatchNorm) and leaky ReLu except for the output layer, which has tanh activation and no BatchNorm.

Classification and Segmentation: To introduce the classification capability, we add two classification heads (CHs), one in each subnetwork on top of the last EU. Each CH consists of a single convolutional layer with 512 filters followed by BatchNorm, ReLu activation, global averaging pooling, and a classification layer. The CH of N1 predicts the class of images of the source dataset, while CH of N2 predicts the class of color augmented dataset. The AION can also be used for the segmentation. For this case, both N1 and N2 are fed with the source dataset, and the classification heads are not trained.

Training Methodology: To train the AION for stain normalization, an image \(I_{s}\) from the source dataset is given as input to N1. Another image \(I_{u,aug}\), obtained through some transformation \(\phi (\cdot )\) on image \(I_u\) from the source dataset, is given as input to N2, where indices u and s may or may not be equal. N1 aims to reconstruct the input image, while N2 attempts to match \(I_{u}\). If we denote the output of N1 as \(\bar{I}_{s}\) and of N2 as \(\bar{I}_{u}\), the loss function is given as:

$$\begin{aligned} \mathcal {L}_1=MSE(\bar{I}_{s},I_{s})+MSE(\bar{I}_{u},I_{u}), \end{aligned}$$
(1)

where \(MSE(\cdot )\) is the mean square error. Hence, N1 is learning an identity transformation, and N2 is learning \(\phi ^{-1}\), assuming such inverse exists. Once trained, N2 should transform the stain (color) of the input data to that of the data input to N1. The classification heads are also trained simultaneously to predict the labels of the input images. Hence, the complete loss function is given as:

$$\begin{aligned} \mathcal {L}=MSE(\hat{I}_{s},I_{s})+MSE(\bar{I}_{u},I_{u})+NLL(y_1,\hat{y}_1)+NLL(y_2,\hat{y}_2), \end{aligned}$$
(2)

where NLL is the negative log-likelihood, y is the true label, and \(\hat{y}\) is the predicted label by the classification head. The subscripts 1 and 2 denote the classification head of Network-1 and Network-2, respectively. It should be noted that classification Head-1 is trained on the source image data, while classification Head-2 is trained on the augmented data.

For segmentation, the tanh activations at the output layers are replaced with sigmoid activations. The input of both the networks is the same, and the classification heads are not trained. Also, the original (not augmented) data is used for training. The respective segmented mask is predicted from both the networks, and both the networks are trained simultaneously using the binary cross-entropy loss.

3 Results

In this work, we use four publicly available datasets. Camelyon17 challenge dataset [3] is used for stain normalization, PatchCamelyon (PCam) [20] for classification, and Data Science Bowl (DSB) [1] & CVC-ClinicDB (CVC) [2] for segmentation. Camelyon17 consists of hematoxylin and eosin (H&E) stained whole slide images (WSIs) of lymph node sections collected from five centers labeled as \(C_0\) to \(C_4\). For training and testing, the patches of size \(128\times 128\) are extracted from the level-0 of the WSIs. PCam have \(96\times 96\) patches having a binary label indicating the presence or absence of metastatic tissue. There is stain-variation among the patches, which makes the classification task challenging. The DSB is a cell segmentation dataset having images of varying sizes. We have extracted \(128\times 128\) patches for training and testing. CVC is a polyps segmentation dataset with 612 images of size \(384 \times 288\) pixels. For training with CVC, we have augmented the training set using random rotations. The distribution of the patches/images for all the datasets is provided in Table 1, and the sample images are shown in Fig. 2.

Table 1. Description of the four dataset. Highlighted cells for Camelyon17 represent the sum of train and val dataset of the respective columns as the architecture trained on one center’s is tested on another center’s whole data.
Fig. 2.
figure 2

Sample images from the datasets. (Set-1: left to right) Image from C0, C1, C2, C3, C4, and C5 of Camelyon17. (Set-2) augmented images from C3 obtained with transformation \(\phi (\cdot )\). (Set-3) PCam. (Set-4) DSB, (Set-5) CVC.

Stain-Normalization: For Camelyon17, there are variations in the colors of stained images of the five centers (Fig. 2). Intuitively, a classifier trained on one center’s data may perform unsatisfactorily on the remaining centers’ data. We train a classifier on a particular center’s data and test on the remaining centers’ data for stain-normalization. The stain normalization quality can then be assessed through improvement in the classification performance over the unnormalized data. We begin our analysis with C3 as the training data. For comparison, we chose Macenko [9] and Reinhard [13] as non-DL methods, and StainGAN [16] and AION\(^{{-}{-}}\) [19] as DL methods. StainGAN [16] is trained using C3 and C0 because it requires two training datasets. Results are also reported by adding a classification head to AION\(^{{-}{-}}\). We name this architecture as AION\(^{{-}{-}}+\)H. Both AION and AION\(^{{-}{-}}+\)H are trained with the same training methodology as discussed in Sect. 2. To obtain the augmented images, \(\phi (\cdot )\) consists of saturation in the range [0, 2.5] and hue in the range [0.5,0.5] with a probability of 0.95. The sample augmented images are shown in Fig. 2. Some results are shown with an external classifier (EC) that consists of 10 convolutional layers, batch normalization and Leaky ReLU as activation, and a classification layer.

All the models are implemented using PyTorch 1.1.0 and trained using RTX2080 GPU. Except for classifiers, all the models are trained using Adam optimizer for 80 epochs, batch size of 32, and initial learning rate of 0.001, which is reduced to one-tenth of the present value if there is no change in the validation performance for seven epochs. For classifiers, an SGD optimizer and a batch size of 64 are used for 150 epochs. The initial learning rate of 0.001 is reduced to one-tenth at 80th, 120th, and 140th epochs. The following results are reported with AION\(^{{-}{-}}+\)H and AION:

  1. 1.

    AION\(^{{-}{-}}+\)H (H): Results from the classification head that is added to AION\(^{{-}{-}}\)

  2. 2.

    AION (H1): Results with N1’s classification head of AION

  3. 3.

    AION (H2): Results with N2’s classification Head of AION

  4. 4.

    AION\(^{{-}{-}}+\)H (EC): Results with an external classifier trained on a center, and tested on normalized images obtained from AION\(^{{-}{-}}+\)H

  5. 5.

    AION (N2+EC): Results with an external classifier trained on a center, and tested on normalized images obtained from N2 of AION

Results with C3 as training data are shown in Table 2 in terms of balanced accuracy (BAC), the area under the curve (AUC), and the weighted F1-score (WF1). The mean performance of the methods requiring reference images is inferior to other methods, showing their dependency on the reference image’s choice. It can be seen that stain normalization leads to improved classification performance over original images for all the centers. For C0, AION (N2+EC) gives the top performance with a significant margin in terms of BAC and WF1. For C1 and C4, AION\(^{{-}{-}}+\)H (EC) and AION (N2+EC) are having approximately similar performances, and for C2, AION\(^{{-}{-}}\) is giving the best results. Apart from C2, both AION\(^{{-}{-}}+\)H (EC) and AION (N2+EC) are performing better than AION\(^{{-}{-}}\). This shows the contributions of the added components to the AION\(^{{-}{-}}\). Another significant observation is the better performance of AION (H1) than original images. This shows that AION (H1) has also become stain-invariant, even though it is trained with the unnormalized images. This capability may have been introduced due to the coupling in the AION. AION\(^{{-}{-}}+\)H (H) and AION (H2) are trained with augmented images and have seen varying stain colors. Hence, the performance improvement with these architectures is apparent. Table 3 shows the inter-center train-test results in terms of AUC. For all the centers, there is performance enhancement after stain normalization. Also, for most cases, results with AION are better than AION\(^{{-}{-}}+\)H, which are better than AION\(^{{-}{-}}\).

Table 2. Results with C3 as training data, and other centers as the test data. The shaded cells represent mean results with five various reference images.
Table 3. AUC with inter-center training-testing

Classification: To highlight AION’s classification capability, we use the binary class dataset of PCam with inter-image stain variations. We have also compared the results with AION\(^{{-}{-}}+\)H to show the proposed architecture’s contribution. Results are summarized in Table 4.I. A methodology similar to stain normalization is used for training with PCam. First, we trained EC on the original PCam training set. This classifier provided an AUC of 0.8448 on the original PCam test set. The AION\(^{{-}{-}}+\)H (H), AION (H1), and AION (H2) provide a gain of 4%, 4.61%, and 3.66%, respectively. Even though these classifiers are shallow (6 layers) than EC (10 layers), they provided a significant gain. This gain could be more with deeper heads. Hence, the proposed architecture can produce stain-invariant classifiers as a byproduct. To explore further, we stain-normalized PCam using AION\(^{{-}{-}}+\)H and AION trained on C3 of Camelyon17. Training and testing with the resultant dataset provided an AUC of 0.8948 and 0.9105 with AION\(^{{-}{-}}+\)H and AION (N2), respectively, which again proves AION’s usefulness. We also replaced the custom classifier with ResNet-24 and DenseNet-161. The best AUC is achieved with AION (N2) for both classifiers, which is 0.9356 with ResNet-34, and 0.9275 with DenseNet-161.

Table 4. (I) PCam Classification results, (II) CVC and DSB Segmentation results

Segmentation: In one set of experiments, we considered segmentation as the downstream task and initialized AION\(^{{-}{-}}+\)H and AION with the network’s weights trained on C3 (C3-Pretrained). In contrast, in the other experiment, they are initialized randomly. Also, as the classification head is not trained, AION\(^{{-}{-}}\) and AION\(^{{-}{-}}+\)H are the same for segmentation. Both AION\(^{{-}{-}}+\)H and AION gave an inferior performance with C3-pretrained networks as compared to random initialization (Table 4.II). This shows that stain-normalization-specific features are, perhaps, not helpful in the segmentation task on this dataset. Also, the performance difference between the two initializations is large for CVC as compared to DSB. This is due to the similarity in the imaging modality of DSB and C3. Also, AION has better performance than AION\(^{{-}{-}}+\)H for both datasets. The qualitative results are also presented in Fig. 3.

Fig. 3.
figure 3

Sample segmentation masks from AION\(^{{-}{-}}+\)H and AION (H2) from CVC.

4 Conclusion

In this work, we proposed a novel self-supervision based dual-transformation coupled-network architecture for stain normalization. The architecture is also equipped with classification heads that can achieve stain-invariant classification. The utility is also shown for the downstream task of segmentation. The architecture can perform well on all three applications of stain normalization, classification, and segmentation.