Keywords

1 Introduction

Histopathological diagnosis consists of examining tissue samples to characterize their phenotype, morphology, and the topological distribution of their constituents. With advancements in slide-scanning technologies, tissue specimens can now be digitized into Whole-Slide Images (WSIs) with high resolution, enabling the pathological assessment to be conducted on a computer rather than under a microscope. A WSI contains rich tissue information and can be up to 100 000\(\times \)100 000 pixels in size at 40\(\times \) magnification (0.25\(\mu \)m/pixel). Due to the image size, complexity, and multi-scale nature of biological systems, a pathologist generally examines a WSI in a hierarchical manner, i.e., detecting informative regions at a low magnification, and evaluating selected areas at a high magnification, as shown in Fig. 1(a). However, such manual examination of a gigapixel-sized WSI can be cumbersome, time-consuming, and prone to inter- and intra-observer variability [14, 16].

To alleviate the aforementioned challenges, Deep Learning (DL)-based diagnosis tools are being developed in digital pathology. However, these tools encounter additional challenges pertaining to the size of WSIs, and the difficulty of acquiring fine-grained annotations. To this end, DL methods have been proposed, in particular, using Multiple Instance Learning (MIL). Here, a WSI is decomposed into a bag of patches, which are individually encoded by a Convolutional Neural Network (CNN) backbone. A pooling operation then combines the patch embeddings into a slide-level representation that is finally mapped to the slide label. Although MIL methods have achieved remarkable performance on several pathology tasks, e.g., tumor classification [10, 32, 39, 42], tumor segmentation [22, 30], and survival prediction [44], they pose the following drawbacks.

First, the performance of MIL methods relies on a carefully tuned context-resolution trade-off [6, 36, 41], i.e., an optimal operating resolution that includes adequate context in a patch. As the dimensions of diagnostically relevant tissue vary significantly in histopathology, patches of different sizes across magnifications convey different context information about the tissue microenvironment. Thus, identifying an optimal resolution and patch size involves several tailored and tedious steps. Typical MIL methods use patches at a single magnification (Fig. 1(b)) and disregard the spatial distribution of patches, causing the above problem. Although [33, 40] address this via visual self-attention, they are constrained by expensive computations of attention scores on a large number of patches in a WSI. Differently, [26] addresses the issue via random patch sampling and sparse convolutions, consequently preventing deterministic inference. In the literature, other methods [17, 19, 41] are extracting concentric patches across multiple magnifications (Fig. 1(c)) to acquire richer context per patch. However, they are computationally more expensive as they need to encode all patches at high magnification and the corresponding patches across lower magnifications.

Second, most MIL methods process all tissue patches at high magnification, thus processing a large number of uninformative patches, which increases computational cost, inference time, and memory requirements. For instance, inference on a WSI of \(50\,000 \times 50\,000\) pixels using CLAM [32], an MIL method, requires \(\approx \)150 Tera Floating-Point Operations (FLOPs), which is 37 500\(\times \) the processing of an ImageNet [12] sample by ResNet34 [18]. Further, the high memory footprint of MIL methods inhibits their scalability to large histopathology images, e.g., prostatectomy slides which can be 300 000\(\times \)400 000 pixels at 40\(\times \) magnification. Such computational requirements can in turn hinder the clinical deployment of these methods. Their adoption becomes even prohibitive when computational resources are scarce due to limited access to GPUs or cloud services. In view of the above challenges, a multi-scale context-aware MIL method with high computational efficiency is desired.

Fig. 1.
figure 1

Comparison of different methods for the diagnosis of WSIs

In this paper, we propose ZoomMIL, a novel method inspired by the hierarchical diagnostic process of pathologists. We first select Regions-of-Interest (ROIs) at low magnification, and zoom in on them at high magnification for finer analysis, as in Fig. 1(d). The RoI selection is performed through a gated-attention and a differentiable top-K (Diff-TopK) module, which learns where to zoom, in an end-to-end manner, while moderating computational requirements at high magnifications. The process can be repeated across an arbitrary number of magnifications, e.g., 5\(\times \) \(\rightarrow \)10\(\times \) \(\rightarrow \)20\(\times \), as per the task at hand. Finally, we aggregate the information acquired across multiple scales to obtain a context-aware WSI representation for downstream pathology tasks, as shown in Fig. 2. In summary, our contributions are:

  1. 1.

    A novel multi-scale context-aware MIL method that learns to perform multi-level zooming in an end-to-end manner for WSI classification.

  2. 2.

    A computationally more efficient method compared to the state of the art in MIL, e.g., 40\(\times \) faster inference on a WSI of size 26 009\(\times \)18 234 pixels at 10\(\times \) magnification, while achieving better (2/3 datasets) or comparable (1/3 datasets) WSI classification performance.

  3. 3.

    Comprehensive benchmarking of the method with regard to WSI classification performance and computational requirements (on GPU and CPU) on multiple datasets across multiple organs and pathology tasks, i.e., tumor subtyping, grading, and metastasis detection.

2 Related Work

2.1 Multiple Instance Learning in Histopathology

MIL in histopathology was introduced in [20] to classify breast and colon RoIs. The experiments established the superiority of attention-based pooling over max and mean pooling. Concurrently, [10] scaled MIL to WSIs for grading prostate biopsies. They proposed Neural Network (RNN)-based pooling for end-to-end training. Later, several works [31, 32, 44] consolidated attention-based MIL across several organs and pathology tasks. Recently, transformer-based MIL [33, 40] has been proposed to consider inter-patch dependencies, with the downside of computing a quadratic number of interactions, which introduces memory constraints. Further, all the above MIL methods are limited to operate on all patches in a WSI at a single magnification. In view of the benefits of multi-scale information in histopathology [4, 15, 19, 28, 36, 43], a few recent methods [17, 27] have extended MIL to combine information across multiple magnifications. However, similar to single-scale methods, these multi-scale versions also require the processing of all patches in a WSI, which is computationally more expensive. In contrast, our proposed ZoomMIL learns to identify informative regions at low magnification and subsequently zooms in on these regions at high magnification for efficient and comprehensive analysis. Differently, several other approaches aim to learn the inter-instance relations in histopathology via Graph Neural Networks (GNNs) [1,2,3, 29, 36, 38, 45] or CNNs [26, 39, 42].

2.2 Instance Selection Strategies in Histopathology

Most MIL methods encode all patches in a WSI irrespective of their functional types. This compels MIL to be computationally expensive for large WSIs. To reduce the computational memory requirements, [26] randomly sampled a subset of instances, with the consequence of potentially missing vital information, especially when the informative set is small, e.g., in metastasis detection. Differently, reinforcement learning-based methods [13, 37] have also been developed to this end. [37] proposed to sequentially identify some of the diagnostically relevant RoIs in a WSI by following a parameterized policy. However, the method leverages a very coarse context for the RoI identification and is limited to utilizing only single-scale information for the diagnosis. Additionally, the reinforcement learning method [13] and the recurrent visual attention-based model [7] aim to select patches, which mimics pathological diagnosis. However, these methods require pixel-level annotations to learn discriminative regions, which is expensive to acquire on large WSIs. In contrast to the above methods, ZoomMILrequires only WSI-level supervision. Our method is flexible to attend to several magnifications, while efficiently classifying WSIs with high performance.

The attention-score-based iterative sampling strategy proposed in [23, 25] closely relates to our work. For the final classification, the selected patch embeddings are simply concatenated, analogous to average pooling. Instead, ZoomMILincorporates a dual gated-attention module between two consecutive magnifications to simultaneously learn to select the relevant instances to be zoomed in on, and learn an improved WSI-level representation for the lower magnification.

The patch selection module employed in our work is inspired by the perturbed optimizer-based [8] differentiable Top-K algorithm proposed in [11]. ZoomMILadvances upon [11] by extending to several magnifications, i.e., multi-level zooming, and scaling the applications to gigapixel-sized WSIs.

3 MIL with Differentiable Zooming

In this section, we present ZoomMIL, which identifies informative patches at low magnification and zooms in on them for fine-grained analysis. In Sect. 3.1, we introduce the gated-attention mechanism determining the informative patches at a given magnification. In Sect. 3.2, we describe how to enable the attention-based patch selection to be differentiable while employing multiple magnifications. Finally, we present in Sect. 3.3 our overall architecture, in particular our proposed Dual Gated Attention and multi-scale information aggregation.

3.1 Attention-Based MIL

In MIL, an input X is considered as a bag of instances \(X = \{\textbf{x}_1, ..., \textbf{x}_N\}\). Given a classification task with C labels, there exists an unknown label \(\textbf{y}_i \in C\) for each instance and a known label \(\textbf{y} \in C\) for the bag. In our context, the input is a WSI and the instances denote the extracted patches. We follow the embedding-based MIL approaches [20, 32, 40], where a patch-level feature extractor h maps each patch \(\textbf{x}_i\) to a feature vector \(\textbf{h}_i = h(\textbf{x}_i) \in \mathbb {R}^D\). Afterwards, a pooling operator \(g(\cdot )\) aggregates the feature vectors \(\textbf{h}_{i=1:N}\) to a single WSI-level feature representation. Finally, a classifier \(f(\cdot )\) uses the WSI representation to predict the WSI-level label \(\hat{\textbf{y}} \in C\). The end-to-end process can be summarized as:

$$\begin{aligned} \hat{\textbf{y}} = f \bigg ( g \Big ( \{ h(\textbf{x}_1), \dots , h(\textbf{x}_N) \} \Big ) \bigg ) \;. \end{aligned}$$
(1)

To aggregate the patch features, we use attention-pooling, specifically, Gated Attention (GA) from [20]. Let \(\textbf{H} = [\textbf{h}_1, \dots , \textbf{h}_N]^\top \in \mathbb {R}^{N \times D}\) be the patch-level feature matrix, then the WSI-level representation \(\textbf{g}\) is computed as:

$$\begin{aligned} \textbf{g} = \sum _{i=1}^{N} a_i \textbf{h}_i, \qquad a_i = \frac{\exp \{\textbf{w}^\top (\tanh (\textbf{V}\textbf{h}_i) \odot \eta (\textbf{U} \textbf{h}_i))\}}{\sum _{j=1}^N \exp \{ \textbf{w}^\top (\tanh (\textbf{V}\textbf{h}_j) \odot \eta (\textbf{U} \textbf{h}_j)) \}} \;, \end{aligned}$$
(2)

where \(\textbf{w}\) \(\in \) \(\mathbb {R}^{L \times 1}\), \(\textbf{V}\) \(\in \) \(\mathbb {R}^{L \times D}\), \(\textbf{U}\) \(\in \) \(\mathbb {R}^{L \times D}\) are learnable parameters with hidden dimension L, \(\odot \) is element-wise multiplication, and \(\eta (\cdot )\) is the sigmoid function. While previous attention-based MIL methods [20, 32] were designed to operate at a single magnification, we propose an efficient and flexible framework that can be extended to arbitrarily many magnifications while being fully differentiable.

Fig. 2.
figure 2

Overview of the proposed ZoomMIL. (I) and (II) present the distinct training and inference modes, generically exemplified for M magnifications.

3.2 Attending to Multiple Magnifications

We assume the WSI is accessible at magnifications indexed by \(m \in \{1, \dots , M\}\), where the highest magnification is at M and the magnification at \(m+1\) is twice that at m, consistent with the pyramidal format of WSIs. To efficiently extend MIL to multiple magnifications, we hierarchically identify informative patches from low-to-high magnifications and aggregate their features to get the WSI representation. To identify the patches at m, we first compute \(\textbf{a}_m \in ~\mathbb {R}^N\), which includes an attention score per patch. Then, the top K patches with the highest scores are selected for further processing at a higher magnification. The corresponding selected patch feature matrix is denoted by

$$\begin{aligned} \widetilde{\textbf{H}}_{m} = \textbf{T}^\top _{m} \textbf{H}_{m} \;, \end{aligned}$$
(3)

where \(\textbf{T}_{m} \in \{0,1\}^{N \times K}\) is an indicator matrix and \(\textbf{H}_{m} \in \mathbb {R}^{N \times D}\) is the patch feature matrix at m.

Instead of a handcrafted approach, we propose to drive the patch selection at m directly by the prediction output of \(f(\cdot )\). This could be achieved via a backpropagation path from the output of \(f(\cdot )\) to the attention module at m, without introducing any additional loss or associated hyperparameters. However, this naive formulation is non-differentiable as it involves a Top-K operation. To address this problem, we build on the perturbed maximum method [8] to make the Top-K selection differentiable, inspired by [11], and apply it to the attention weights \(\textbf{a}_{m}\) at magnification m. Specifically, \(\textbf{a}_{m}\) is first perturbed by adding uniform Gaussian noise \(\textbf{Z} \in \mathbb {R}^N\). Then, a linear program is solved for each of the perturbed attention weights, and their results are averaged. The forward pass of the differentiable Top-K module can thus be written as:

$$\begin{aligned} \textbf{T} = \mathop {\mathbb {E}}_{\textbf{Z}\sim \mathcal {N}(\textbf{0}, \mathbb {1})} \Big [ \mathop {{{\,\mathrm{arg\,max}\,}}}\limits _{\hat{\textbf{T}}} \langle \hat{\textbf{T}}, \big ( \textbf{a}_m + \sigma \textbf{Z} \big ) \textbf{1}^\top \rangle \Big ] \;, \end{aligned}$$
(4)

where \(\textbf{1}^\top = [1 \cdots 1] \in \mathbb {R}^{1 \times K}\) and \((\textbf{a}_m + \sigma \textbf{Z})\textbf{1}^\top \in \mathbb {R}^{N \times K}\) denotes the perturbed attention weights repeated K times, and \(\langle \cdot \rangle \) is a scalar product preceded by a vectorization of the matrices. The corresponding Jacobian is defined as:

$$\begin{aligned} J_{\textbf{a}_m}\textbf{T} = \mathop {\mathbb {E}}_{\textbf{Z}\sim \mathcal {N}(\textbf{0}, \mathbb {1})} \Big [ \mathop {{{\,\mathrm{arg\,max}\,}}}\limits _{\hat{\textbf{T}}} \langle \hat{\textbf{T}}, \big ( \textbf{a}_m + \sigma \textbf{Z} \big ) \textbf{1}^\top \rangle \textbf{Z}^\top / \sigma \Big ] \;. \end{aligned}$$
(5)

More details on the derivation are provided in the supplemental material. The differentiable Top-K operator enables to learn the parameters of the attention module that weighs the patches at specific magnifications. Unlike [11], where patch sizes are scaled proportionally to the magnifications, we maintain a constant patch size across magnifications. This renders the number of patches proportional to the magnifications. It also provides different fields-of-view of the tissue microenvironment and enables us to capture a variety of contexts. This is crucial for analyzing WSIs as they contain diagnostically relevant constituents of various sizes. To achieve the zooming objective, we expand the indicator matrix \(\textbf{T}_m\) to select from the patch features \(\textbf{H}_{m'}~\in \mathbb {R}^{N \cdot 4^{(m'-m)} \times D}\), where \(m' > m\). Specifically, we compute the Kronecker product between \(\textbf{T}_{m}\) and the identity matrix \(\mathbb {1}_{m'} = \textrm{diag}(1, \cdots , 1) \in \mathbb {R}^{4^{(m'-m)} \times 4^{(m'-m)}}\) to obtain the expanded indicator matrix \(\textbf{T}_{m'} \in \{0, 1\}^{N \cdot 4^{(m'-m)} \times K \cdot 4^{(m'-m)}}\). Analogously to Eq. (3), patch selection at \(m'\) using the attention weights from m can be performed using

$$\begin{aligned} \widetilde{\textbf{H}}_{m'} = \big ( \textbf{T}_{m} \otimes \mathbb {1}_{m'} \big )^\top \textbf{H}_{m'}, \end{aligned}$$
(6)

where \(\textbf{H}_{m'}\) is the feature matrix at \(m'\) and \(\widetilde{\textbf{H}}_{m'}\) is the selected feature matrix.

3.3 Dual Gated Attention and Multi-scale Aggregation

Figure 2 shows ZoomMILin its training (I) and inference (II) mode.

Training Mode: The feature matrix \(\textbf{H}_{1}\) at m=1 passes through a Dual Gated Attention (DGA) block. DGA consists of two gated-attention modules \(\text {GA}_{1}\) and \(\text {GA}'_{1}\). \(\text {GA}_{1}\) is trained to obtain an optimal attention-pooled WSI-level representation \(\textbf{g}_1\) at low magnification. \(\text {GA}_{1}'\) calculates attention weights \(\textbf{a}'_1\) that are used to identify important patches to zoom in. Alternatively, a single attention module could be used for both tasks. However, this would prevent optimal zooming, as the selected low-magnification patches would aim to optimize the classification performance only with information from the low magnification. Employing separate attention modules decouples the optimization tasks, and in turn, enables to obtain complementary information from both magnifications. Subsequently, the differentiable Top-K selection module, \({\textbf {T}}_1\), is employed to learn to select the most informative patches. The following selected higher-magnification patch feature matrix \(\widetilde{\textbf{H}}_{2}\) is obtained via Eq. (6).

The process of selecting patch features for every subsequent higher magnification is repeated until the highest magnification M. The selected patch features \(\widetilde{\textbf{H}}_{M}\) at M go through a last gated-attention block \(\text {GA}_{M}\) to produce \(\textbf{g}_{M}\). Finally, the attention-pooled features from all magnifications, \(\textbf{g}_{1}, \textbf{g}_{2}, \dots , \textbf{g}_{M}\), are aggregated via sum-pooling to get a multi-scale, context-aware representation for the WSI. Inspired by residual learning [18], sum-pooling is used, as the features across different magnifications are closely related and the summation leverages their complementarity. The final classifier \(f(\cdot )\) maps the WSI representation to the label \(y \in C\) by producing the model prediction \({\hat{\textbf{y}}}\). The training phase can be regarded as extending Eq. (1) with sum-pooling over multiple magnifications:

$$\begin{aligned} \hat{\textbf{y}} = f \Big ( \textbf{g}_{1}(\textbf{H}_{1}) + \textbf{g}_{2}(\widetilde{\textbf{H}}_{2}), \dots + \textbf{g}_{M}(\widetilde{\textbf{H}}_{M}) \Big ) \;. \end{aligned}$$
(7)

Inference Mode: The differentiable Top-K operator in our model learns to identify informative patches during training. However, this operator includes random perturbations to the attention weights, and thus makes the forward pass of the model non-deterministic. Therefore, we replace differentiable Top-K with conventional non-differentiable Top-K during inference, which is also faster as no perturbations have to be computed. As shown in Fig. 2, another crucial difference to the training mode is that the patch selection directly operates on the WSI patches, \(\textbf{P}_{m'} \in {\mathbb {R}}^{N \cdot 4^{{(m'-1)}} \times p_{h} \times p_{w} \times p_{c}}\), instead of the pre-extracted patch features \(\textbf{H}_{m'}\). This avoids the extraction of features for uninformative patches during inference, unlike other MIL methods. It significantly reduces the computational requirements and speeds up model inference.

4 Experiments

4.1 Datasets

We benchmark ZoomMILon three H &E stained, public WSI datasets.

CRC [34] contains 1133 colorectal biopsy and polypectomy slides from non-neoplastic, low-grade, and high-grade lesions, accounting for 26.5%, 48.7%, 24.8% of the data. The slides were acquired at the IMP Diagnostics laboratory, Portugal, and were digitized by a Leica GT450 scanner at 40\(\times \). We split the data into 70%/10%/20% stratified sets for training, validation, and testing.

BRIGHT [9] consists of breast WSIs from non-cancerous, precancerous, and cancerous subtypes. The slides were acquired at the Fondazione G. Pascale, Italy, and scanned by an Aperio AT2 scanner at 40\(\times \). We used the BRIGHT challenge splitsFootnote 1 containing 423, 80, and 200 WSIs for training, validation, and testing.

CAMELYON16 [5] includes 270 WSIs, 160 normal and 110 with metastases, for training, and 129 slides for testing. The slides were scanned by 3DHISTECH and Hamamatsu scanners at 40\(\times \) at the Radboud University Medical Center and the University Medical Center Utrecht, Netherlands. We split the 270 slides into 90%/10% stratified sets for training and validation.

The average number of (pixels, patches), within the tissue area, at 20\(\times \) magnification for CRC, BRIGHT, and CAMELYON16 datasets are (227.28 Mpx, 3468), (1.04 Gpx, 15872), and (648.28 Mpx, 9892), respectively.

4.2 Implementation Details

Preprocessing: For each WSI, we detect the tissue area using a Gaussian tissue detector [21] and divide the tissue into 256\(\times \)256 patches at all considered magnifications. We ensure that each high-magnification patch is associated with the corresponding lower-magnification patch. We encode the patches with ResNet-50 [18] pre-trained on ImageNet [12] and apply adaptive average pooling after the third residual block to obtain 1024-dimensional embeddings.

ZoomMIL: The gated-attention module comprises three 2-layer Multi-Layer Perceptrons (MLPs), where the first two are followed by Hyperbolic Tangent and Sigmoid activations, respectively. The classifier is a 2-layer MLP with ReLU activation. We use a dropout probability of 0.25 in all fully-connected layers.

Implementation: All methods are implemented in PyTorch [35] and run on a single NVIDIA A100 GPU. ZoomMILuses \(K=\{16, 12, 300\}\) on CRC, BRIGHT, and CAMELYON16, respectively, and our more efficient variant ZoomMIL-Eff uses \(K=\{12, 8\}\) on CRC and BRIGHT, respectively. We use the Adam optimizer [24] with 0.0001 learning rate and plateau scheduler (patience=5 epochs, decay rate=0.8). The experiments are run for 100 epochs with a batch size of one. For CRC & CAMELYON16, the models with the best validation loss are saved for testing. On BRIGHT, we observed that the baselines perform poorly compared to ZoomMILwhen using validation loss as the model selection criterion. We therefore employ best validation weighted-F1 for model selection on BRIGHT since it improves the baselines, giving them a better competitive chance against ZoomMIL.

4.3 Results and Discussion

Baselines: We compare ZoomMILwith state-of-the-art MIL methods. Specifically, we compare with ABMIL [20], which uses a gated-attention pooling, and its variant CLAM [32], which also includes an instance-level clustering loss. We further compare with two spatially-aware methods, namely, TransMIL [40] which models instance-level dependencies using transformer-based pooling, and SparseConvMIL [26] which selects random subsets of patches and employs sparse convolutions for pooling. In addition, we compare with multi-scale methods MSMIL [17] and DSMIL [27], which are computationally less efficient than ZoomMILas they encode all patches in a WSI across all considered magnifications. For completeness, we also include vanilla MIL methods based on max-pooling (MaxMIL) [26] and mean-pooling (MeanMIL) [26], following SparseConvMIL’s strategy of random patch selection. Additional implementation details and hyper-parameters are provided in the supplemental material. For a fair comparison, preprocessing including the extraction of patch embeddings is done consistently in the same manner, as described in Sect. 4.2.

4.3.1 WSI Classification Performance:

We present the classification results in terms of weighted F1-score and accuracy in Table 1, 2, and 3. Mean±standard deviation of the metrics is computed over three runs with different weight initializations. Corresponding magnifications of operation are shown alongside each method for each dataset. We include two versions of ZoomMILusing either 2 or 3 magnifications, denoted as ZoomMIL-Eff (efficient) and ZoomMIL.

Table 1. Performance and efficiency measurement on CRC [34]. The best and second-best classification results are in bold and underlined, respectively.
Table 2. Performance and efficiency measurement on BRIGHT [9]. The best and second-best classification results are in bold and underlined, respectively.
Table 3. Performance and efficiency measurement on CAMELYON16 [5]. The best and second-best classification results are in bold and underlined, respectively.

On CRC, ZoomMILoutperforms CLAM-SB and TransMILby 1.1% and 2.2% weighted F1-score, and ZoomMIL-Eff achieves comparable performance. Furthermore, ZoomMILshows superior performance compared to the multi-scale methods MSMILand DSMIL. For the individual classes, ZoomMILachieves 94.3%, 93.6%, and 86.4% average F1-scores in the one-vs-rest setting.

WSIs in BRIGHT are 4.5\(\times \) larger than in CRC and thus provide a better evaluation ground for efficient scaling. ZoomMILachieves the best performance, outperforming MSMILby 6.6%, CLAM-SB and DSMILby 5.2%, and TransMILby 2.8% in weighted F1-score. Notably, ZoomMIL-Eff achieves the second-best results. For the individual classes, ZoomMILreaches average F1-scores of 70.4%, 56.5%, and 77.8%. The performance is lowest for the challenging pre-cancerous class, which often resembles the other two classes.

For CAMELYON16, we set the lowest magnification to 10\(\times \) as the metastatic regions can be extremely small (see Fig. 4). Nevertheless, it still adversely impacts the performance, resulting in 1.1% lower average accuracy than TransMIL. However, this translates to misclassifying only 1–2 test WSIs.

Overall, ZoomMILperforms better on CRC and BRIGHT, while being comparable to the state of the art on CAMELYON16. It also consistently outperforms ZoomMIL-Eff, highlighting the apparent performance-efficiency trade-off, i.e., performance reduction in exchange for gains in computational efficiency.

4.3.2 Efficiency Measurements:

We analyze the efficiency in terms of FLOPs and average processing time for inference (see Table 1, 2, and 3). Note that the computational cost in the MIL modules is negligible compared to patch feature extraction, which is computationally the most expensive. The FLOPs and processing time for different methods can therefore appear to be equal as their difference only becomes visible several digits after the decimal point. On CRC, ZoomMILuses \(\approx \)10\(\times \) less FLOPs and time than CLAM-SB and TransMIL. Compared to MSMILand DSMIL, this factor increases to >12\(\times \). On BRIGHT, our efficient variant reduces computational requirements by >50\(\times \) compared to MSMILand DSMIL, and >40\(\times \) compared to CLAM-SB and TransMILwhile providing comparable performance. On CAMELYON16, ZoomMILuses \(\approx 1/3\) FLOPs compared to MSMIL, DSMIL, CLAM-SB, and TransMIL. The relatively lower efficiency gain is due to the fact that metastatic regions occupy only a small fraction of a WSI, and thus need to be analyzed at a finer magnification. Across all datasets, the methods adopting random patch selection (MaxMIL, MeanMIL, and SparseConvMIL) have similar computational requirements as ZoomMILbut perform significantly worse.

To further highlight our efficiency gain, we show in Fig. 3 the model throughput (images/hour) against the performance (accuracy) for all methods on BRIGHT. The marked efficiency frontier curves signify the best possible accuracies for different minimal throughput requirements. Noticeably, ZoomMIL-Eff running on a single-core CPU processor (\(\approx \)300 images/h) provides similar throughput to MSMIL, DSMIL, CLAM-SB, and TransMILrunning on a cutting-edge NVIDIA A100 GPU (\(\approx \)500–600 images/h). ZoomMIL’s low computational requirements make it more practical and suitable for clinical deployment, where IT infrastructures are often under-developed and need large investments to establish and maintain a digital workflow.

Fig. 3.
figure 3

Throughput vs classification accuracy for different MIL methods on BRIGHT, (left) on 1 single-core CPU, (right) on 1 NVIDIA A-100 GPU. Efficiency frontier curves are drawn in red and blue for CPU and GPU, respectively. (Color figure online)

4.3.3 Interpretability.

We interpret ZoomMILby qualitatively analyzing its patch-level attention maps. Figure 4(a,b) show the maps for two cancerous WSIs in BRIGHT at 1.25\(\times \), and Fig. 4(c-f) show the maps for four metastatic WSIs in CAMELYON16 at 10\(\times \). We further include corresponding tumor regions annotated by an expert pathologist for comparison. Brighter regions in the maps mark higher attention scores, i.e., more influential for model prediction.

For the BRIGHT WSIs, ZoomMILcorrectly attends to cancerous areas in (a,b), pays lower attention to the pre-cancerous area in (b), and least attention to the remaining non-cancerous areas that include non-cancerous epithelium, stroma, and adipose tissue. For the CAMELYON16 WSIs, (c,d) are correctly classified as ZoomMILgives high attention to the metastatic regions of different sizes. However, the extremely small metastases in (e,f) get low attention and are disregarded by the Top-K module leading to misclassifying the WSIs. Notably, for cases with tiny metastases, relatively higher attention is imparted to the periphery of the tissues. This is consistent with the fact that metastases generally appear in the subcapsular zone of lymph nodes, as can be observed in (c-f). The presented visualizations are obtained from low magnifications, which signifies ZoomMIL’s ability to learn to zoom in. More interpretability maps for other classes and fine-grained attention maps from higher attention modules in ZoomMILare provided in the supplemental material.

Fig. 4.
figure 4

Annotated tumor regions and attention maps from the lowest magnification of ZoomMILare presented for (a,b) BRIGHT and (c-f) CAMELYON16 WSIs.

4.3.4 Ablation Study:

We ablated different modules in ZoomMIL-Eff, due to its simple 2-magnification model. The results on BRIGHT are given in Table 4.

Table 4. Ablation study on BRIGHT, with the varied algorithmic component tabulated in the left-most column. All experiments use \(K=8\).

Differentiable Patch Selection: We compared our attention-based differentiable patch selection (Diff-TopK) against three alternatives: random selection at the lowest magnification (RandomK @ 1.25\(\times \)), random selection at the highest magnification (Random4K @ 2.5\(\times \)), and the non-differentiable Top-K selection (NonDiff-TopK) at the lowest magnification. The top rows in Table 4 show the superiority of Diff-TopK. Due to its differentiability, it learns to select patches via the gradient optimization of the model’s prediction.

Dual Gated Attention: We examined DGA consisting of two separate gated attention modules \(\text {GA}_{1}\) and \(\text {GA}'_{1}\) at low magnification, as discussed in Sect. 3.3. The former computes a slide-level representation and the latter learns to select patches at higher magnification. We can conclude from Table 4 that two separate attentions lead to better patch selection and improved slide representation for overall improved classification.

Feature Aggregation: We aggregate slide-level representations across magnifications through sum-pooling, as shown in Eq. (7). Among several alternatives, we compared with: using the highest-magnification features (Features@2.5\(\times \)) and fusing representations via concatenation (represented as @1.25\(\times \) || @2.5\(\times \)). Table 4 shows that concatenation improves performance, indicating the value of multi-scale information. However, our sum-pooling, which is inspired by residual learning [18], significantly outperforms concatenation as it leverages the complementarity of the two magnifications more effectively.

5 Conclusion

In this work, we introduced ZoomMIL, a novel framework for WSI classification. The method is more than an order of magnitude faster than previous state-of-the-art methods during inference while achieving comparable or better accuracy. Essential for our method is the concept of differentiable zooming that allows the model to learn which patches are informative and thus worth zooming in on. We conduct extensive quantitative and qualitative evaluations on three different datasets and demonstrate the importance of each component in our model with a detailed ablation study. Finally, we show that ZoomMILis a modular architecture that can easily be deployed in different flavors, depending on the performance-efficiency requirements in a given application. In future work, it would be interesting to further study the attention maps of ZoomMIL and compare them with the visual attention of pathologists.