Keywords

1 Introduction

Oropharyngeal cancer (OPC) is the sixth most common cancer in the world, accounting for 10–15% of cancers of the head and neck [1]. Of the 5000 new cases of OPC diagnosed each year in the United States, 85–90% originated from malignancies of the epithelium [2]. Although OPC is predominantly diagnosed in individuals over 45 years old, studies in Western Europe and the United States have shown an increasing incidence in people under 45 years old over the past 20–30 years [3]. Therefore, the precise assessment of patients with OPC can help to develop individualized treatment. Pathological examination is the gold standard for the diagnosis of OPC, while the cellular composition and spatial structure of epithelium can reflect tumor heterogeneity. Hence, accurate segmentation of cancer epithelium in OPC pathological images is significant for quantitative diagnosis and subsequent image analysis [4].

Hematoxylin-Eosin (HE) staining is a routine staining method in pathology [5]. Earlier epithelium segmentation tasks were performed by manual annotation of HE-stained pathology images by pathologists, but this process is time-consuming and inefficient [6]. In addition, since the epithelium of OPC often has similar morphological and color characteristics with other tissues (such as stroma), it is difficult to distinguish epithelium from stroma only by HE staining [7]. Using immunohistochemistry (IHC) staining for cytokeratin could specifically detect the epithelium in the OPC samples based on the principle of antigen-antibody-specific binding [8]. As shown in Fig. 1, the epithelium is not evident in the HE-stained image in Fig. 1-a, but is clearly stained brown in the section stained for cytokeratin in Fig. 1-b, while other tissues are stained light blue.

Fig. 1.
figure 1

A sample of the OPC dataset.

However, there are some limitations in IHC staining. Firstly, IHC staining is time-consuming and rather not a routine clinical workflow, which requires additional costs. Secondly, the quality of IHC staining is easily affected by staining kits and conditions, which can lead to under/over-staining [6].

With the development of deep learning and the exponential growth of digitalized medical image data, the use of convolutional neural networks (CNN) for tissue/tumor lesion segmentation of digital pathology images has reached unprecedented popularity [9]. U-Net [10] was a milestone in the semantic segmentation of medical images and has been followed by many excellent segmentation networks [11, 12]. In recent years, many GAN-based methods have also been used to accomplish this task in ways of synthesis. It is worth pointing out that these methods are usually used to segment tumor lesions or necrotic regions, but when it comes to the task of segmenting epithelium regions, these methods show their shortcomings for the following reasons: firstly, the epithelium tissue and stroma are all stained in a similar shade of purple; secondly, in terms of morphology, the epithelium does not have obvious characteristics distinguished from the stroma and lymphoid tissue, etc. [13].

To address the above challenges, inspired by the manual annotation process, we propose a two-step epithelial tissue segmentation framework and a stain-style transfer network named CS-Net. The details will be described in Sect. 3. The main contributions of this paper are as follows:

  1. 1.

    A new two-step framework for epithelial tissue segmentation. Compared with the single-step methods that segment epithelial tissue masks directly from HE-stained images, our two-step method overcomes the challenge of “indistinct epithelial tissue information in the HE-stained image”, and achieves higher accuracy. The first step of the framework is to generate pseudo-IHC-stained images from HE-stained images; the second step is to perform threshold segmentation to obtain the binary mask from the synthetic IHC-stained image.

  2. 2.

    A novel stain-style transfer network and attention module. For the characteristics of pathological images, we innovatively incorporate U-Net, VGG16, and CS-Gate into U-Net to become CS-Net. CS-Gate is an integration of CBAM and attention-gate, aiming to extract features from the channel and spatial aspects simultaneously (a fusion of channel attention, spatial attention, and attention gate, hence the name CS-Gate). Compared with the mainstream segmentation networks, CS-Net can achieve higher accuracy with 92.48%; compared with the GAN-based networks, CS-Net can generate pseudo-IHC-stained images with colors and textures closer to Ground Truth, reaching SSIM with 82.02%.

  3. 3.

    Better generalization capability. The external generalization experiment show that CS-Net has the highest generalization performance, achieving an accuracy of 85.83% on the BCSS dataset without training on it.

2 Related Works

2.1 Tissue Segmentation

Epithelium tissue segmentation is the key step in the diagnostic analysis of digital pathology images, and the accuracy of the segmentation is crucial to the subsequent diagnosis. In recent years, due to the development of deep learning, convolutional neural networks (CNN) have shone in the field of medical image analysis, displaying superior capabilities that are not inferior to manual annotation at all. Among them, U-Net [10] is the pioneer of using CNN for medical image segmentation, which is featured by an encoder and a decoder, and the features located in the same layer are connected by skip-connection. However, as the tasks became more complex, U-Net began to show its shortcomings: it could only extract relatively simple and obvious features, while often predicting too many false-positive (FP) and false-negative (FN) regions when distinguishing epithelial tissue from stroma [14] since these regions are highly similar in morphology.

After U-Net, many excellent segmentation networks emerged, which can be divided into two categories based on their characteristics. One category is based on feature engineering, and the other is based on feature learning. For the first category, Swin-UNet [15] integrated information on the color and cellular arrangement of tissues in texture features; SegNet [16, 32] constructed a symmetric encoding-decoding structure for capturing similarities in the overall structure and appearance of stained sections; Van [17] proposed an improved DCAN that integrates the original DCAN and the identity mapping method proposed by He [18] into the ResNet architecture. The other category is data-driven, which is highly dependent on a large number of manually annotated datasets and usually requires large data and computational resources, such as [19, 20] which focus on capturing the morphological patterns inherent within the dataset.

Despite the continuous emergence of new models, only a few have been able to significantly improve the segmentation performance of epithelial tissue in pathological images. The reason is that the staining quality of the slides varies from different labs and operators, which causes the high variability of the dataset, hence requiring stronger feature extraction capabilities.

2.2 Attention Mechanism

One way to tackle the challenge of insufficient extraction capacity is the adoption of the attention mechanism. First proposed in 2014 by Bahdanau [2, 33], the attention mechanism can help the models assign different weights to different features within an image while extracting more crucial information. In this way, networks can make more accurate judgments without imposing an extreme burden on computation and storage. In the field of computer vision, attention mechanisms can be divided into three main categories: channel attention, spatial attention, and hybrid domain attention. Channel attention generates and scores the mask of the channel, such as SE-Net [21] and ECA-Net [22]. Spatial attention can be seen as an adaptive spatial region selection mechanism, such as RAM [23], STN [9], and GE-Net [7] are the typical spatial attention modules with different emphases. The hybrid domain attention mechanism evaluates and scores both channel attention and spatial attention, emphasizing meaningful features on both dimensions, like DA-Net [5], Attention gate (AG) [24], and CBAM [25].

In this paper, we integrate AG and the advantages of channel and spatial attention into a new attention module, in order to better adapt to the high variation between different pathological images and the high similarity of features within a pathological image.

2.3 Stain-Style Transfer

Another effective way to improve the accuracy of epithelium segmentation is to transfer HE-stained images into IHC-stained images, for the variability and heterogeneity would be greatly reduced. Subsequently, the epithelium mask can be obtained by image processing methods. Thus our goal can be shifted from a segmentation task to a stain-style transfer task.

Most of the style transfer models are based on Generative Adversarial Networks (GAN) [26], which consist of a Generator (G) and a Discriminator (D). Pix2pix is the pioneer of applying GAN to the field of image conversion, and Isola proposed cycleGAN [27] based on the idea of pix2pix. StainGAN [28] converted Hamamatsu staining to Aperio staining based on cycleGAN and achieved a high degree of visual similarity with the target domain. However, the GAN-based generation method actually has some drawbacks. Firstly, it is difficult to train, like cycleGAN is training two G and two D at the same time, which requires large running memories and storage. Secondly, the loss curve oscillates substantially because different G and D are constantly passing information to each other, and the predicted output is often unstable.

There are also some methods that try to utilize CNN solely to perform stain-style transformation [29]. For example, Gatys improve the performance of stain-style transformation by adding an artistic style algorithm into CNN[30]; based on Gatys, Chen added the Sobel operator and an improved loss function to the network in order to enhance the edge information of the synthetic images [31, 34]. Compared with GAN-like methods, these single CNN models can improve the transformation accuracy by enhancing the feature extraction ability of the network itself, while effectively avoiding loss oscillation and memory consumption problems. However, there is no single CNN model for the stain-style transfer task of the pathological image so far, hence this paper attempts to propose an improved CNN model specifically for the stain-style transformation.

3 Proposed Method

We propose a two-step epithelium segmentation framework, and the key step is stain-style transfer. The two steps will be described in the following two subsections, and the pipeline of the framework is shown in Fig. 2:

Fig. 2.
figure 2

The Overview of our proposed two-step OPC epithelium tissue segmentation network, and the structure of our proposed CS-Net.

Fig. 3.
figure 3

The details of our proposed CS_Gate module.

3.1 Step 1: Stain-Style Transfer

The first step is stain-style transfer, which is to transfer HE-stained images into IHC-stained images to highlight the epithelium region. We propose a stain-style transfer model named CS-Net (based on U-Net and CS-Gate module, hence the name CS-Net), and the structure is shown in the blue box in Fig. 2. We have made two improvements based on U-Net.

The first improvement is to replace the encoder of the U-Net with VGG16, which possesses stronger feature extraction capabilities. It was observed that the pseudo-IHC stained images generated by U-Net (as a generative network) were blurred and faint with severe FP phenomenon. The reason is that U-Net does not extract deeper, essential features at the encoding stage. This problem can be effectively solved by replacing the encoder with VGG16, which has a more reasonable convolutional layer design and convolutional depth.

The second improvement is the insertion of the attention module. Inspired by the attention gate and CBAM module, we innovatively combine these two modules into one and named CS-Gate. Our approach is to combine the advantages of AG and CBAM together: the low and high-level features are encoded separately, then combine together and extracted by the channel and spatial attention to be encoded into a feature weight. In this way, CS-Gate allows the network to receive multi-dimensional information, as shown in Fig. 3. Moreover, we utilize the smooth L1Loss function to better fit the stain-style transfer task for pathology images, and it can be described as:

$$ smooth \;L1\;Loss = \frac{1}{n}\sum\nolimits_i {z_i } $$
(1)

where:

$$ z_i = \left\{ {\begin{array}{*{20}c} {0.5(x_i - y_i )^2 ,} & {\left| {x_i - } \right.\left. {y_i } \right| < 1} \\ {\left| {x_i - } \right.\left. {y_i } \right| - 0.5,} & {otherwise} \\ \end{array} } \right. $$
(2)

\(x_i\) and \(y_i\) represent the predicted value and actual value.

3.2 Step 2: Image Processing

The second step is image processing described in Algorithm 1. The corresponding mask would be obtained from the synthesized pseudo-IHC-stained images vis this algorithm.

figure a

It is worth pointing out that the values of 5000 and 800 in steps 3 and 4 are empirical thresholds obtained from experimental observations on a portion of the dataset. It is observed that the small holes with an area < 5000 inside the tissue and the spots with an area > 800 outside the tissue are generally noisy rather than mistakes due to segmentation, so the epithelial tissue mask obtained by Algorithm 1 is closer to the real label.

4 Experiments

4.1 Dataset Description

The dataset used in this paper is the tissue microarray (TMA) pathology images from the OPC of Guangdong Provincial People's Hospital, with 208 samples of 2048 × 2048 sizes, each with a HE-stained image, a paired IHC-stained image, and an epithelial tissue mask (as shown in Fig. 1). All samples are divided in the ratio of train set: validation set: test set = 6:2:2. We use a sliding window of 256 × 256 sizes from left to right and then from top to bottom with overlap = 50% to take the image patches. All the predicted patches are stitched back together in the order of segmentation, and the maximum value of the overlapping parts is taken as the final pixel value of the region.

In addition, we use BCSS (Breast Cancer Cases) Dataset as external verification to prove the generalization of the proposed method, which contains 151 HE-stained whole-slide images (WSIs), corresponding to 151 tissue classification labels (Fig. 4).

Fig. 4.
figure 4

The method of dividing a 2048 × 2048 image into several 256 × 256 patches.

4.2 Evaluation Metrics

In this paper, Structural Similarity (SSIM) is used to evaluate the similarity between the generated IHC-stained images and Ground Truth (GT):

$$ SSIM = \frac{{\left( {2\mu_x \mu_y + c_1 } \right)\left( {\sigma_{xy} + c_2 } \right)}}{{\left( {\mu_x^2 + \mu_y^2 + c_1 } \right)\left( {\sigma_x^2 + \sigma_y^2 + c_2 } \right)}} $$
(3)

where \(x\) stands for the generated IHC-stained image; \(y\) stands for the GT; \(\mu_x\) and \(\mu_y\) represent the mean value of \(x\) and \(y\); \(\sigma_x\) and \(\sigma_y\) represent the standard deviation of \(x\) and \(y\); \(\sigma_{xy}\) represents the covariance of \(x\) and \(y\); \(c_1\) and \(c_2\) are constants respectively.

In addition, five metrics (Accuracy, Precision, Specificity, Recall, and F1-score) are used to evaluate the results of the final epithelium binary masks obtained via the proposed framework compared with the common segmentation networks:

$$ {\text{Accuracy}} = \frac{TP + TN}{{TP + TN + FP + FN}} $$
(4)
$$ {\text{Precision}} = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FP}}}} $$
(5)
$$ {\text{Specificity}} = \frac{{{\text{TN}}}}{{{\text{FP}} + {\text{TN}}}} $$
(6)
$$ {\text{Recall}} = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FN}}}} $$
(7)
$$ {\text{F}}_1 - {\text{score}} = \frac{{2 \cdot {\text{Precision}} \cdot {\text{Recall}}}}{{{\text{Precision}} + {\text{Recall}}}} $$
(8)

The meanings of TP, TN, FP, and FN are shown in the following Table 1:

Table 1. Confusion Matrix.

4.3 Implementation Details

The device is Dell Precision 3640, with Intel Core i9-10900K CPU@3.70 GHz × 20 and NVIDIA GeForce RTX 3090. Keep the hyper-parameters the same for all experiments, Loss = L1Loss, Batch_size = 16, epoch = 20, Learning_rate = 0.0002, and the Learning_rate decay strategy is to halve every two epochs. To avoid the influence of experimental errors, the results of each network are the average values of training three times ± standard deviation.

Firstly, we conduct experiments on U-Net as the baseline and then combine U-Net with attention gate, CBAM, and our CS-Gate module respectively. Secondly, we replace the encoder of U-Net with VGG16 (which turns into VGG16UNet), then plug the above three attention modules into it. Finally, we compare our method with the two most commonly used generative networks (cycleGAN and pix2pix). The results are shown in Table 2:

Table 2. Comparison of other networks and our CS-Net.
Fig. 5.
figure 5

Pseudo-IHC-stained images generated by different models.

Fig. 6.
figure 6

The epithelium binary masks obtained from the pseudo-IHC-stained images via Algorithm 1.

It can be seen that for U-Net, whether it is plugged with AG, CBAM, or CS-Gate, the results have not improved even slightly decreased. However, after replacing the encoder with VGG16 to become VGG16UNet, the results have been greatly improved (from 84.86% to 88.33%). Then three attention modules were plugged into it subsequently, the accuracy and SSIM were improved in different aspects. Among them, our proposed CS-Gate improves the most (from 88.33% to 92.48%). From these results, it is reasonable to deduce that the original encoder of U-Net is not capable of extracting effective features enough from HE-stained images. But once replacing the encoder with VGG16, which has stronger feature extraction capabilities, the accuracy has been significantly improved, and the three attention modules finally work when they are added into VGG16UNet.

4.4 Qualitative Analysis Results

Figure 5 shows the pseudo-IHC-stained images generated by different generative networks. Observing the first and the fourth row, the checkerboard artifact is greatly reduced by CS-Net; in the second row, CS-Net generates the most similar IHC-stained image, while restoring color and texture features as much as possible; in the third row, CS-Net can reduce the FP phenomenon most. It can be seen that CS-Net has synthesized the most similar image to the GT.

Figure 6 shows the mask obtained from the pseudo-IHC-stained images via Algorithm 1. It can be seen that CS-Net can generate the most accurate mask. U-Net has the second-best performance after CS-Net but is still showing a high false positive in the fourth row. Pix2pix and cycleGAN have the worst performance, showing not only a high chessboard artifact but also high FP and FN phenomena in all samples.

Figure 7 illustrates that CS-Net has higher stability and lower loss than GAN-based generative networks (such as cycleGAN and pix2pix). Table 1 also shows that CS-Net can synthesize more similar images (about 10% higher in SSIM).

Fig. 7.
figure 7

The comparison of loss curves of 3 methods.

4.5 Generalization Performance

Figure 8 shows the obtained epithelium binary masks of BCSS patches predicted by different methods, aiming to validate their generalization ability. It can be seen that U-Net predicts accurately in some cases, but does not perform well overall. Pix2pix and cycleGAN both exhibit poor generalization ability. CS-Net shows the best generalization capability and robustness among these methods.

Fig. 8.
figure 8

The epithelium binary masks of BCSS images predicted by different methods.

5 Conclusion

This paper proposes a stain-style transfer network named CS-Net, which has strong feature extraction capability and can synthesize more stable and higher-quality images with better color and texture details. CS-Net outperforms the UNet-based network when performing segmentation task, and outperforms some GAN-based networks when performing style-transfer task, with an outstanding accuracy of 92.48%. CS-Net also shows a higher generalization ability than the U-Net-based and GAN-based networks.

CS-Net is the key step of our presented two-step framework to perform epithelium tissue region segmentation, which is to first generate a psuedo-IHC-stained image and then perform image processing to obtain the mask. Compared with the one-step methods which solely uses HE-stained images to do the segmentation task, our method shows higher superiority.

However, there are also some shortcomings in this work, and more researches could be done in the future. Firstly, we found that the more epochs are trained, the better the results would be, so better results would be obtained theoretically if time allowed. Secondly, a larger batch size would yield better results, but also take up more running memory. If the computing power is increased, such as calling multiple GPUs or providing more running memory to run a larger batch size.