Keywords

1 Introduction

Histopathology is considered the gold standard for diagnosing and treating many cancers [19]. The tissue slices are usually scanned into Whole Slide Images (WSIs) and serve as important references for pathologists. Unlike natural images, WSIs typically contain billions of pixels and also have a pyramid structure, as shown in Fig. 1. Such gigapixel resolution and expensive pixel-wise annotation efforts pose unique challenges to constructing effective and accurate models for WSI analysis. To overcome these challenges, Multiple Instance Learning (MIL) has become a popular paradigm for WSI analysis. Typically, MIL-based WSI analysis methods have three steps: (1) crop the huge WSI into numerous image patches; (2) extract instance features from the cropped patches; and (3) aggregate instance features together to obtain slide-level prediction results. Many advanced MIL models emerged in the past few years. For instance, ABMIL [9] and DeepAttnMIL [18] incorporated attention mechanisms into the aggregation step and achieved promising results. Recently, Graph-Transformer architecture [17] has been proposed to learn short-range local features through GNN and long-range global features through Transformer simultaneously. Such Graph-Transformer architecture has also been introduced into WSI analysis [15, 20] to mine the thorough global and local correlations between different image patches. However, current Graph-Transformer-based WSI analysis models only consider the representation learning under one specific magnification, thus ignoring the rich multi-resolution information from the WSI pyramids.

Different resolution levels in the WSI pyramids contain different and complementary information [3]. The images at a high-resolution level contain cellular-level information, such as the nucleus and chromatin morphology features [10]. At a low-resolution level, tissue-related information like the extent of tumor-immune localization can be found [1], while the whole WSI describes the entire tissue microenvironment, such as intra-tumoral heterogeneity and tumor invasion [3]. Therefore, analyzing from only a single resolution would lead to an incomplete picture of WSIs. Some very recent works proposed to characterize and analyze WSIs in a pyramidal structure. H2-MIL [7] formulated WSI as a hierarchical heterogeneous graph and HIPT [3] proposed an inheritable ViT framework to model WSI at different resolutions. Whereas these methods only characterize local or global correlations within the WSI pyramids and use only unidirectional interaction between different resolutions, leading to insufficient capability to model the rich multi-resolution information of the WSI pyramids.

In this paper, we present a novel Hierarchical Interaction Graph-Transformer framework (i.e., HIGT) to simultaneously capture both local and global information from WSI pyramids with a novel Bidirectional Interaction module. Specifically, we abstract the multi-resolution WSI pyramid as a heterogeneous hierarchical graph and devise a Hierarchical Interaction Graph-Transformer architecture to learn both short-range and long-range correlations among different image patches within different resolutions. Considering that the information from different resolutions is complementary and can benefit each other, we specially design a Bidirectional Interaction block in our Hierarchical Interaction ViT module to establish communication between different resolution levels. Moreover, a Fusion block is proposed to aggregate features learned from the different levels for slide-level prediction. To reduce the tremendous computation and memory cost, we further adopt the efficient pooling operation after the hierarchical GNN part to reduce the number of tokens and introduce the Separable Self-Attention Mechanism in Hierarchical Interaction ViT modules to reduce the computation burden. The extensive experiments with promising results on two public WSI datasets from TCGA projects, i.e., kidney carcinoma (KICA) and esophageal carcinoma (ESCA), validate the effectiveness and efficiency of our framework on both tumor subtyping and staging tasks. The codes are available at https://github.com/HKU-MedAI/HIGT.

Fig. 1.
figure 1

Overview of the proposed HIGT framework. A WSI pyramid will be constructed as a hierarchical graph. Our proposed Hierarchical Interaction GNN and Hierarchical Interaction ViT block can capture the local and global features, and the Bidirectional Interaction module in the latter allows the nodes from different levels to interact. And finally, the Fusion block aggregates the coarse-grained and fine-grained features to generate the slide-level prediction.

2 Methodology

Figure 1 depicts the pipeline of HIGT framework for better exploring the multi-scale information in hierarchical WSI pyramids. First, we abstract each WSI as a hierarchical graph, where the feature embeddings extracted from multi-resolution patches serve as nodes and the edge denotes the spatial and scaling relationships of patches within and across different resolution levels. Then, we feed the constructed graph into several hierarchical graph convolution blocks to learn the short-range relationship among graph nodes, following pooling operations to aggregate local context and reduce the number of nodes. We further devise a Separable Self-Attention-based Hierarchical Interaction Transformer architecture equipped with a novel Bidirectional Interaction block to learn the long-range relationship among graph nodes. Finally, we design a fusion block to aggregate the features learned from the different levels of WSI pyramids for final slide-level prediction.

2.1 Graph Construction

As shown in Fig. 1, a WSI is cropped into numerous non-overlapping \(512\times 512\) image patches under different magnifications (i.e., \(\times 5\), \(\times 10\)) by using a sliding window strategy, where the OTSU algorithm [4] is used to filter out the background patches. Afterwards, we employ a pre-trained KimiaNet [16] to extract the feature embedding of each image patch. The feature embeddings of the slide-level \(\boldsymbol{T}\) (Thumbnail), region-level \(\boldsymbol{R}\) (\(\times 5\)), and the patch-level \(\boldsymbol{P}\) (\(\times 10\)) can be represented as,

$$\begin{aligned} \boldsymbol{T}&=\{\boldsymbol{t}\}, \nonumber \\ \boldsymbol{R}&=\{\boldsymbol{r}_1, \boldsymbol{r}_2, \cdots , \boldsymbol{r}_N\}, \nonumber \\ \boldsymbol{P}&= \{\boldsymbol{P}_{1}, \boldsymbol{P}_{2}, \cdots , \boldsymbol{P}_{N}\}, \boldsymbol{P_i} = \{\boldsymbol{p}_{i,1}, \boldsymbol{p}_{i,2}, \cdots , \boldsymbol{p}_{i,M}\}, \end{aligned}$$
(1)

where \(\boldsymbol{t}, \boldsymbol{r}_i, \boldsymbol{p}_{i, j} \in \mathbb {R}^{1 \times C}\) correspond to the feature embeddings of each patch in thumbnail, region, and patch levels, respectively. N is the total number of the region nodes and M is the number of patch nodes belonging to a certain region node, and C denotes the dimension of feature embedding (1,024 in our experiments). Based on the extracted feature embeddings, we construct a hierarchical graph to characterize the WSI, following previous H\(^2\)-MIL work [7]. Specifically, the cropped patches serve as the nodes of the graph and we employ the extracted feature embedding as the node embeddings. There are two kinds of edges in the graph: spatial edges to denote the 8-adjacent spatial relationships among different patches in the same levels, and scaling edges to denote the relationship between patches across different levels at the same location.

2.2 Hierarchical Graph Neural Network

To learn the short-range relationship among different patches within the WSI pyramid, we propose a new hierarchical graph message propagation operation, called RAConv+. Specifically, for any source node j in the hierarchical graph, we define the set of it all neighboring nodes at resolution k as \(\mathcal {N}_k\) and \(k\in K\). Here K means all resolutions. And the \(h_k\) is the mean embedding of the node j’s neighboring nodes in resolution k. And \(h_{j\prime }\) is the embedding of the neighboring nodes of node j in resolution k and \({h}_{{j\prime }} \in \mathcal {N}_k\). The formula for calculating the attention score of node j in resolution-level and node-level:

$$\begin{aligned}&\alpha _k=\frac{\exp \left( \boldsymbol{a}^{\top } \cdot \text {LeakyReLU}\left( \left[ \boldsymbol{U} \boldsymbol{h}_j \Vert \boldsymbol{U} \boldsymbol{h}_k\right] \right) \right) }{\sum _{k^{\prime } \in \mathcal {K}} \exp \left( \boldsymbol{a}^{\top } \cdot \text {LeakyReLU}\left( \left[ \boldsymbol{U} \boldsymbol{h}_j \Vert \boldsymbol{U} \boldsymbol{h}_{k^{\prime }}\right] \right) \right) }, \nonumber \\&\alpha _{{j\prime }}=\frac{\exp \left( \boldsymbol{b}^{\top } \cdot \text {LeakyReLU}\left( \left[ \boldsymbol{V} \boldsymbol{h}_j \Vert \boldsymbol{V} \boldsymbol{h}_{{j\prime \prime }}\right] \right) \right) }{\sum _{ {h}_{{j\prime \prime }} \in \mathcal {N}_k} \exp \left( \boldsymbol{b}^{\top } \cdot \text {LeakyReLU}\left( \left[ \boldsymbol{V} \boldsymbol{h}_j \Vert \boldsymbol{V} \boldsymbol{h}_{{j\prime \prime }}\right] \right) \right) }, \nonumber \\&\alpha _{j,j\prime }=\alpha _k+ \alpha _{j\prime }, \end{aligned}$$
(2)

where \(\alpha _{j,j\prime }\) is the attention score of the node j to node \(j\prime \) and \(h_j\) is the source node j embedding. And U, V, a and b are four learnable layers. The main difference from H2-MIL [6] is that we pose the non-linear LeakyReLU between a and U, b and V, to generate a more distinct attention score matrix which increases the feature differences between different types of nodes [2]. Therefore, the layer-wise graph message propagation can be represented as:

$$\begin{aligned} H^{(l+1)}=\sigma \left( \mathcal {A} \cdot H^{(l)} \cdot W^{(l)}\right) , \end{aligned}$$
(3)

where \(\mathcal {A}\) represents the attention score matrix, and the attention score for the j-th row and j\(\prime \)-th column of the matrix is given by Eq. (2). At the end of the hierarchical GNN part, we use the IHPool [6] progressively aggregate the hierarchical graph.

2.3 Hierarchical Interaction ViT

We further propose a Hierarchical Interaction ViT (HIViT) to learn long-range correlation within the WSI pyramids, which includes three key components: Patch-level (PL) blocks, Bidirectional Interaction (BI) blocks, and Region-level (RL) blocks.

Patch-Level Block. Given the patch-level feature set \(\boldsymbol{P}=\bigcup _{i=1}^N \boldsymbol{P}_i\), the PL block learns long-term relationships within the patch level:

$$\begin{aligned} \hat{\boldsymbol{P}}^{l+1}=PL(\boldsymbol{P}^l) \end{aligned}$$
(4)

where \(l = 1, 2, ..., L\) is the index of the HIViT block. \(PL(\cdot )\) includes a Separable Self Attention (SSA) [13], 1\(\times \)1 Convolution, and Layer Normalization in sequence. Note that here we introduced SSA into the PL block to reduce the computation complexity of attention calculation from quadratic to linear while maintaining the performance [13].

Bidirectional Interaction Block. We propose a Bidirectional Interaction (BI) block to establish communication between different levels within the WSI pyramids. The BI block performs bidirectional interaction, and the interaction progress from region nodes to patch nodes is:

$$\begin{aligned}&\boldsymbol{r_i^{l'}} \in \boldsymbol{R^{l'}}, \quad \boldsymbol{R^{l'}} = SE(\boldsymbol{R^l})\cdot \boldsymbol{R^l}, \nonumber \\&\boldsymbol{P}_i^{l+1} = \{\boldsymbol{p}_{i,1}^{l+1},\boldsymbol{p}_{i,2}^{l+1}, \cdots , \boldsymbol{p}_{i,k}^{l+1}\},\quad \boldsymbol{p}_{i,k}^{l+1} = \hat{\boldsymbol{p}}_{i,k}^{l+1}+\boldsymbol{r}_i^{l'}, \end{aligned}$$
(5)

where the \(SE(\cdot )\) means the Sequeeze-and-Excite layer [8] and the \(\boldsymbol{r_i^{l'}}\) means the i-th region node in \(\boldsymbol{R^{l'}}\), and \(\hat{\boldsymbol{p}}_{i,k}^{l+1}\) is the k-th patch node linked to the i-th region node after the interaction. Besides, another direction of the interaction is,

$$\begin{aligned}&\bar{\boldsymbol{P}} = \{\bar{\boldsymbol{P}}_1^{l+1},\bar{\boldsymbol{P}}_2^{l+1},\cdots , \bar{\boldsymbol{P}}_n^{l+1}\}, \quad \bar{\boldsymbol{P}}_i^{l+1} = MEAN(\hat{\boldsymbol{P}}_i^{l+1})\nonumber \\&\hat{\boldsymbol{R}}^{l+1} = SE(\bar{\boldsymbol{P}}^{l+1} )\cdot \bar{\boldsymbol{P}}^{l+1}+\boldsymbol{R}^{l}, \end{aligned}$$
(6)

where the \(MEAN(\cdot )\) is the operation to get the mean value of patch nodes set \(\hat{\boldsymbol{P}}_i^{l+1}\) associated with the i-th region node and \(\bar{\boldsymbol{P}}_1^{l+1} \in \mathcal {R}^{1\times C}\) and the C is the feature channel of nodes, and \(\hat{\boldsymbol{R}}^{l+1}\) is the region nodes set after interaction.

Region-Level Block. The final part of this module is to learn the long-range correlations of the interacted region-level nodes:

$$\begin{aligned} {\boldsymbol{R}}^{l+1}=RL(\hat{\boldsymbol{R}}^{l+1}) \end{aligned}$$
(7)

where \(l = 1, 2, ..., L\) is the index of the HIViT module, \(\boldsymbol{R}=\{\boldsymbol{r}_1, \boldsymbol{r}_2, \cdots , \boldsymbol{r}_N\}\), and \(RL(\cdot )\) has a similar structure to \(PL(\cdot )\).

2.4 Slide-Level Prediction

In the final stage of our framework, we design a Fusion block to combine the coarse-grained and fine-grained features learned from the WSI pyramids. Specifically, we use an element-wise summation operation to fuse the coarse-grained thumbnail feature and patch-level features from the Hierarchical Interaction GNN part, and then further fuse the fine-grained patch-level features from the HIViT part with a concatenation operation. Finally, a \(1\times 1\) convolution and mean operation followed by a linear projection are employed to produce the slide-level prediction.

3 Experiments

Datasets and Evaluation Metrics. We assess the efficacy of the proposed HIGT framework by testing it on two publicly available datasets (KICA and ESCA) from The Cancer Genome Atlas (TCGA) repository. The datasets are described below in more detail:

  • KICA dataset. The KICA dataset consists of 371 cases of kidney carcinoma, of which 279 are classified as early-stage and 92 as late-stage. For the tumor typing task, 259 cases are diagnosed as kidney renal papillary cell carcinoma, while 112 cases are diagnosed as kidney chromophobe.

  • ESCA dataset. The ESCA dataset comprises 161 cases of esophageal carcinoma, with 96 cases classified as early-stage and 65 as late-stage. For the tumor typing task, there are 67 squamous cell carcinoma cases and 94 adenocarcinoma cases.

Experimental Setup. The proposed framework was implemented by PyTorch [14] and PyTorch Geometric [5]. All experiments were conducted on a workstation with eight NVIDIA GeForce RTX 3090 (24 GB) GPUs. The shape of all nodes’ features extracted by KimiaNet is set to \(1\times 1024\). All methods are trained with a batch size of 8 for 50 epochs. The learning rate was set as 0.0005, with Adam optimizer. The accuracy (ACC) and area under the curve (AUC) are used as the evaluation metric. All approaches were evaluated with five-fold cross-validations (5-fold CVs) from five different initializations.

Table 1. Comparison with other methods on ESCA. Top results are shown in bold.
Table 2. Comparison with other methods on KICA. Top results are shown in bold.

Comparison with State-of-the-Art Methods. We first compared our proposed HIGT framework with two groups of state-of-the-art WSI analysis methods: (1) non-hierarchical methods including: ABMIL [9], CLAM-SB [12], DeepAttnMIL [18], DS-MIL [11], LA-MIL [15], and (2) hierarchical methods including: H2-MIL [7], HIPT [3]. For LA-MIL [15] method, it was introduced with a single-scale Graph-Transformer architecture. For H2-MIL [7] and HIPT [3], they were introduced with a hierarchical Graph Neural Network and hierarchical Transformer architecture, respectively. The results for ESCA and KICA datasets are summarized in Table 1 and Table 2, respectively. Overall, our model achieves a content result both in AUC and ACC of classifying the WSI, and especially in predicting the more complex task (i.e. Staging) compared with the SOTA approaches. Even for the non-hierarchical Graph-Transformer baseline LA-MIL and hierarchical transformer model HIPT, our model approaches at least around 3% and 2% improvement on AUC and ACC in the classification of the Staging of the KICA dataset. Therefore we believe that our model benefits a lot from its used modules and mechanisms.

Table 3. Ablation analysis on KICA dataset.
Fig. 2.
figure 2

Computational analysis of our framework and some selected SOTA methods. From left to right are scatter plots of Typing AUC v.s. GPU Memory Allocation, Staging AUC v.s. GPU Memory Allocation, Typing AUC v.s. Model Size, Staging AUC v.s. Model Size.

Ablation Analysis. We further conduct an ablation study to demonstrate the effectiveness of the proposed components. The results are shown in Table 3. In its first row, we replace the RAConv+ with the original version of this operation. And in the second row, we replace the Separable Self Attention with a canonical transformer block. The third row changes the bidirectional interaction mechanism into just one direction from region-level to patch-level. And the last row, we remove the fusion block from our model. Finally, the ablation analysis results show that all of these modules we used actually improved the prediction effect of the model to a certain extent.

Computation Cost Analysis. We analyze the computation cost during the experiments to compare the efficiency between our methods and existing state-of-the-art approaches. Besides we visualized the model size (MB) and the training memory allocation of GPU (GB) v.s. performance in KICA’s typing and staging task plots in Fig. 2. All results demonstrate that our model is able to maintain the promising prediction result while reducing the computational cost and model size effectively.

4 Conclusion

In this paper, we propose HIGT, a framework that simultaneously and effectively captures local and global information from the hierarchical WSI. Firstly, the constructed hierarchical data structure of the multi-resolution WSI is able to offer multi-scale information to the later model. Moreover, the redesigned H2-MIL and HIViT capture the short-range and long-range correlations among varying magnifications of WSI separately. And the bidirectional interaction mechanism and fusion block can facilitate communication between different levels in the Transformer part. We use IHPool and apply the Separable Self Attention to deal with the inherently high computational cost of the Graph-Transformer model. Extensive experimentation on two public WSI datasets demonstrates the effectiveness and efficiency of our designed framework, yielding promising results. In the future, we will evaluate on other complex tasks such as survival prediction and investigate other techniques to improve the efficiency of our framework.