Keywords

1 Introduction

Automated delineation of diagnostically relevant regions in histology images is pivotal in developing automated computer-aided diagnosis systems in computational pathology. Accurate delineation assists the focus of the pathologists to improve diagnosis [33]. In particular, this attains high value in analyzing giga-pixel histology images. To this end, several supervised methods have been proposed to efficiently segment glands [8, 29], tumor regions [3, 7], and tissue types [5]. Though these methods achieve high-quality semantic segmentation, they demand tissue, organ and task-specific dense pixel-annotated training datasets. However, acquiring such annotations for each diagnostic scenario is laborious, time-consuming, and often not feasible. Thus, weakly supervised semantic segmentation (\(\mathrm {WSS}\)) methods [10, 43] are proposed to learn from weak supervision, such as inexact coarse image labels, incomplete supervision with partial annotations, and inaccurate supervision where annotations may not always be ground-truth.

\(\mathrm {WSS}\) methods using various learning approaches, such as graphical model, multi-instance learning, self-supervised learning, are reviewed in [10]. \(\mathrm {WSS}\) methods using various types of weak annotations are presented in [2, 43]. Despite the success in delivering excellent segmentation performance, mostly with natural images, \(\mathrm {WSS}\) methods encounter challenges in histology images [10], since histology images contain, (i) finer-grained objects (i.e., large intra- and inter-class variations) [34], and (ii) often ambiguous boundaries among tissue components [39]. Nevertheless, some \(\mathrm {WSS}\) methods were proposed for histology. Among those, the methods in [13, 14, 16, 18, 35, 38] perform patch-wise image segmentation and cannot incorporate global tissue microenvironment context. While [9, 28] propose to operate on larger image-tiles, they remain constrained to working with fixed and limited-size images. Thus, a \(\mathrm {WSS}\) method operating on arbitrary and large histology images by utilizing both local and global context is needed. Further, most methods focus on binary classification tasks. Though HistoSegNet [9] manages multiple classes, it requires training images with exact fine-grained image-level annotations. Exact annotations demand pathologists to annotate images beyond standard clinical needs and norms. Thus, a \(\mathrm {WSS}\) method should ideally be able to learn from inexact, coarse, image-level annotations. Additionally, to generalize to other \(\mathrm {WSS}\) tasks in histology, methods should avoid complex, task-specific post-processing steps, as in HistoSegNet [9]. Notably, \(\mathrm {WSS}\) methods in literature only utilize a single type of annotation. Indeed, complementary information from easily or readily available multiplex annotations can boost \(\mathrm {WSS}\) performance.

To this end, we propose \(\textsc {SegGini}\), “SEGmentation using Graphs with Inexact aNd Incomplete labels”. \(\textsc {SegGini}\) represents a histology image using a superpixel-based tissue-graph and follows a classification approach to segment it. Our major contributions are, (i) \(\textsc {SegGini}\) is the first \(\mathrm {WSS}\) method scalable to arbitrary image sizes, unlike pixel-based \(\mathrm {WSS}\) or fully-connected graph-based \(\mathrm {WSS}\) [26, 41], (ii) to the best of our knowledge, \(\textsc {SegGini}\) is the first \(\mathrm {WSS}\) method to simultaneously learn from weak multiplex supervision, i.e., inexact image-level labels and incomplete scribbles. (iii) \(\textsc {SegGini}\) incorporates both local and global inter-tissue-region relations to build contextualized segmentation, principally in agreement with inter-pixel relation based state-of-the-art \(\mathrm {WSS}\) method [2].

We evaluate our method on two H&E stained prostate cancer datasets [27, 42] and segment Gleason patterns, i.e., Benign (B), Grade3 (GG3), Grade4 (GG4) and Grade5 (GG5), by using incomplete scribbles of Gleason patterns and inexact image-level Gleason grades. Image-level grades are defined as the combination of the most common (primary, P) and the second most common (secondary, S) cancer growth patterns in the image. Figure 1 exemplifies incomplete and inexact annotations, along with complete pixel-level and exact image-level annotation.

Fig. 1.
figure 1

Overview of various annotation types for a sample prostate cancer WSI, (a) complete pixel-level and exact image-level annotation, (b) incomplete scribbles of Gleason patterns, and (c) inexact image-level Gleason grade (P+S).

2 Methods

This section presents the proposed \(\textsc {SegGini}\) methodology (Fig. 2) for scalable \(\mathrm {WSS}\) of histology images. First, an input image is preprocessed and transformed into a tissue graph representation, where the graph nodes denote tissue superpixels. Then, a Graph Neural Network (\(\mathrm {GNN}\)) learns contextualized features for the graph nodes. The resulting node features are processed by a Graph-head, a Node-head, or both based on the type of weak supervision. The outcomes of the heads are used to segment Gleason patterns. Additionally, a classification is performed to identify image-level Gleason grades from the segmentation map.

Preprocessing and Tissue Graph Construction. An input H&E stained image X is stain-normalized using the algorithm in [31] to reduce any appearance variability due to tissue preparation. Then, the normalized image is transformed into a Tissue-Graph (\(\mathrm {TG}\)) (Fig. 2(a)), as proposed in [20]. Formally, we define a \(\mathrm {TG}\) as \(G := (V, E, H)\), where the nodes V encode meaningful tissue regions in form of superpixels, and the edges E represent inter-tissue interactions. Each node \(v \in V\) is encoded by a feature vector \(h(v) \in \mathbb {R}^d\). We denote the node features set, \(h(v), \, \forall v \in V\) as \(H \in \mathbb {R}^{|V| \times d}\). Motivated by [6], we use superpixels as visual primitives, since rectangular patches may span multiple distinct structures.

The \(\mathrm {TG}\) construction follows three steps: (i) superpixel construction to define V, (ii) superpixel feature extraction to define H, and (iii) graph topology construction to define E. For superpixels, we first use the unsupervised SLIC algorithm [1] emphasizing on space proximity. Over-segmented superpixels are produced at a lower magnification to capture homogeneity, offering a good compromise between granularity and noise smoothing. The superpixels are hierarchically merged based on channel-wise color similarity of superpixels at higher magnification, i.e., channel-wise 8-bin color histograms, mean, standard-deviation, median, energy, and skewness. These then form the \(\mathrm {TG}\) nodes. The merging reduces node complexity in the \(\mathrm {TG}\), thereby enabling a scaling to large images and contextualization to distant nodes, as explained in next section. To characterize the \(\mathrm {TG}\) nodes, we extract morphological and spatial features. Patches of 224\(\times \)224 are extracted from the original image and encoded into 1280-dimensional features with MobileNetV2 [23] pre-trained on ImageNet [11]. For a node \(v \in V\), morphological features are computed as the mean of individual patch-level representations that belong to v. Spatial features are computed by normalizing superpixel centroids by the image size. We define the \(\mathrm {TG}\) topology by constructing a region adjacency graph (RAG) [22] from the spatial connectivity of superpixels.

Fig. 2.
figure 2

Overview of the proposed \(\textsc {SegGini}\) methodology. Following superpixel extraction, (a) Tissue-graph construction and contextualization, (b) Graph-head: \(\mathrm {WSS}\) via graph classification, (c) Node-head: \(\mathrm {WSS}\) via node classification.

Contextualized Node Embeddings. Given a \(\mathrm {TG}\), we aim to learn discriminative node embeddings (Fig. 2(a)) that benefit from the nodes’ context, i.e., the tissue microenvironment and inter-tissue interactions. The contextualized node embeddings are further used for semantic segmentation. To this end, we use a \(\mathrm {GNN}\), that operates on graph-structured data [12, 17, 37]. In particular, we use Graph Isomorphism Network (\(\mathrm {GIN}\)) [37] layers, a powerful and fast \(\mathrm {GNN}\) architecture that functions as follows. For each node \(v \in V\), \(\mathrm {GIN}\) uses a sum-operator to aggregate the features of the node’s neighbors \(\mathcal {N}(v)\). Then, it updates the node features h(v) by combining the aggregated features with the current node features h(v) via a multi-layer perceptron (\(\mathrm {MLP}\)). After T \(\mathrm {GIN}\) layers, i.e., acquiring context up to T-hops, the intermediate node features \(h^{(t)}(v), \; t=1,\dots , T\) are concatenated to define the contextualized node embeddings [36]. Formally, a \(\mathrm {GNN}\) \(\mathcal {F}_\theta \) with batch normalization (BN) is described for \(v, u \in V\) as,

$$\begin{aligned} h^{(t+1)}(v)&= \mathrm {MLP}\Big ( \mathrm {BN}\big ( h^{(t)}(v) + \sum _{u \in \mathcal {N}(v)} h^{(t)}(u) \big ) \Big ), \; t=\{0, .., T-1\}\, \end{aligned}$$
(1)
$$\begin{aligned} h(v)&= \mathrm {Concat}\,\Big (\,\Big \{\,h^{(t)}(v) \;\Big |\;t = 1, .., T \, \Big \} \, \Big ) \end{aligned}$$
(2)

Weakly Supervised Semantic Segmentation. The contextualized node embeddings \(h(v), \, \forall v \in V\) for a graph G, corresponding to an image X, are processed by \(\textsc {SegGini}\) to assign a class label \(\in \{1,..,K\}\) to each node v, where K is the number of semantic classes. \(\textsc {SegGini}\) can incorporate multiplex annotations, i.e., inexact image label \(Y_X\) and incomplete scribbles \(Y_S\). Then, the weak supervisions for G are, the graph label \(Y_G\), i.e., the image label \(Y_X\), and node labels \(y_v \in Y_V\) that are extracted from \(Y_S\) by assigning the most prevalent class within each node. This is a reasonable assumption, as the tissue regions are built to be semantically homogeneous. The Graph-head (Fig. 2(b)) and the Node-head (Fig. 2(c)) are executed for using \(Y_G\) and \(Y_V\), respectively. Noticeably, unlike [9], \(\textsc {SegGini}\) does not involve any post-processing, thus being a generic method that can be applied to various organs, tissue types, segmentation tasks, etc.

The Graph-head consists of a graph classification and a feature attribution module. First, a graph classifier \(\mathcal {F}_\phi \) predicts \(\hat{Y}_G\) for G. \(\mathcal {F}_\phi \) includes, (i) a global average pooling readout operation to produce a fixed-size graph embedding \(h_G\) from the node embeddings \(h(v),\forall v \in V\), and (ii) a \(\mathrm {MLP}\) to map \(h_G\) to \(Y_G\). As G directly encodes X, the need for patch-based processing is nullified. \(\mathcal {F}_\theta \) and \(\mathcal {F}_\phi \) are trained on a graph-set \(\mathcal {G}\), extracted from the image-set \(\mathcal {X}\), by optimizing a multi-label weighted binary cross-entropy loss \(\mathcal {L}_G:= l(Y_G, \hat{Y}_G)\). The class-weights are defined by \(w_i = \log (N/n_i), i=1,...,K\), where \(N=|\mathcal {X}|\), and \(n_i\) is the class example count; such that higher weight is assigned to smaller classes to mitigate class imbalance during training. Second, in an off-line step, we employ a discriminative feature attribution technique to measure importance scores \(\forall v \in V\) towards the classification of each class. Specifically, we use \(\textsc {GraphGrad-CAM}\) [15, 21], a version of \(\textsc {Grad-CAM}\) [24] that can operate with \(\mathrm {GNN}\)s. Argmax across class-wise node attribution maps from \(\textsc {GraphGrad-CAM}\) determines the node labels.

The Node-head simplifies image segmentation into classifying nodes \(v \in V\). It inputs \(h(v),\,\forall v \in V\) to a \(\mathrm {MLP}\) classifier \(\mathcal {F}_\psi \) to predict node-labels \(y_v,\,\forall v \in V\). \(\mathcal {F}_\theta \) and \(\mathcal {F}_\psi \) are trained using the multi-class weighted cross-entropy loss \(\mathcal {L}_V:= l(y_v, \hat{y}_v)\). The class-weights are defined by \(w_i = \log (N/n_i), i=1,...,K\), where N is the number of annotated nodes, and \(n_i\) is the class node count. The node-wise predicted classes produce the final segmentation.

Multiplexed Supervision: For multiplex annotations, both heads are executed to perform \(\mathrm {WSS}\). \(\mathcal {F}_\theta \), \(\mathcal {F}_\phi \), and \(\mathcal {F}_\psi \) are jointly trained to optimize a weighted loss \(\mathcal {L} = \lambda \mathcal {L}_G + (1 - \lambda ) \mathcal {L}_V\), with which complementary information from multiplex annotations helps improve the individual classification tasks and thus improving \(\mathrm {WSS}\). Subsequently, we employ the classification approach in [4] to determine the Gleason grades from the generated segmentation maps.

3 Experiments

We evaluate our method on 2 prostate cancer datasets for Gleason pattern segmentation and Gleason grade classification.

UZH dataset [42] comprises of five TMAs with 886 spots, digitized at 40\(\times \) resolution (0.23  \(\mu \)m/pixel). Spots (3100 \(\times \) 3100 pixels) contain complete pixel-level annotations and inexact image-level grades. We follow a 4-fold cross-validation at TMA-level with testing on TMA-80 as in [4]. The second pathologist annotations on the test TMAs are used as a pathologist-baseline.

SICAPv2 dataset [27] contains 18 783 patches of size 512 \(\times \) 512 with complete pixel annotations and WSI-level grades from 155 WSIs at 10\(\times \) resolution. We reconstruct the original WSIs and annotation masks from the patches, containing up to \(11000^2\) pixels. We follow a 4-fold cross-validation at patient-level as in [27]. An independent pathologist’s annotations are included as a pathologist-baseline.

We evaluate the methods for four annotation settings, complete (\(\mathcal {C}\)) and incomplete (\(\mathcal {IC}\)) pixel annotations, inexact image labels (\(\mathcal {IE}\)) as well as \(\mathcal {IE+IC}\). \(\mathcal {IC}\) annotations with various pixel percentages are created by randomly selecting regions from \(\mathcal {C}\) (see supplementary material for more details). We report per-class and average Dice scores as segmentation metrics, and weighted F1-score as a classification metric. We present means and standard-deviations on the test set for 4-fold cross-validation for all experiments.

Baselines: We compare \(\textsc {SegGini}\) with several state-of-the-art methods:

  • UZH-\(\mathrm {CNN}\) [4] and FSConv [27], for segmentation and classification using \(\mathcal {C}\)

  • Neural Image Compression (NIC) [30], Context-Aware \(\mathrm {CNN}\) (CACNN) [25], and CLAM [18], for weakly-supervised classification using \(\mathcal {IE}\)

  • HistoSegNet [9], for weakly supervised segmentation using \(\mathcal {IE}\).

These baselines are implemented based on code and algorithms in the corresponding publications. Baselines [18, 25, 30] directly classify WSI Gleason grades, and do not provide segmentation of Gleason patterns. Also, HistoSegNet [9] was trained herein with \(\mathcal {IE}\), instead of exact image labels, since accessing the exact annotations would require using \(\mathcal {C}\), that violates weak supervision constraints.

Implementations were conducted using PyTorch [19] and DGL [32] on an NVIDIA Tesla P100. \(\textsc {SegGini}\) model consists of 6-\(\mathrm {GIN}\) layers, where the \(\mathrm {MLP}\) in \(\mathrm {GIN}\), the graph-head, and the node-head contain 2-layers each with \(\mathrm {PReLU}\) activation and 32-dimensional node embeddings, inspired by [40]. For graph augmentation, the nodes were augmented randomly with rotation and mirroring. A hyper-parameter search was conducted to find the optimal batch size \(\in \{4, 8, 16\}\), learning rate \(\in \{10^{-3}, 5 \times 10^{-4}, 10^{-4} \}\), dropout \(\in \{ .25, .5\}\), and \(\lambda \in \{.25, .5, .75\}\) for each setting. The methods were trained with Adam optimizer to select the model with the best validation loss. For fair comparison, we evaluated all the baselines with similar patch-level augmentations and hyper-parameter searches.

Table 1. Results on UZH dataset as Mean±std using complete (\(\mathcal {C}\)), inexact (\(\mathcal {IE}\)), incomplete (\(\mathcal {IC}\)), and \(\mathcal {IE+IC}\) settings. Setting-wise best scores are in bold.

Results and Discussion: Table 1 and 2 present the segmentation and classification results of \(\textsc {SegGini}\) and the baselines, divided in groups for their use of different annotations. For the \(\mathcal {C}\) setting, \(\textsc {SegGini}\) significantly outperforms UZH-CNN [4] on per-class and average segmentation as well as classification metrics, while reaching segmentation performance comparable with pathologists. For the \(\mathcal {IE}\) setting, \(\textsc {SegGini}\) outperforms HistoSegNet on segmentation and classification tasks. Interestingly, \(\textsc {SegGini}\) also outperforms the classification-tailored baselines [18, 25, 30]. \(\textsc {SegGini}\) delivers comparable segmentation performance for inexact and complete supervision, i.e., 64% and 66% average Dice, respectively. Comparing \(\mathcal {IC}\) and \(\mathcal {IE+IC}\), we observe that \(\mathcal {IE+IC}\) produces better segmentation, especially in the low pixel-annotation regime. Such improvement, however, lessens with increased pixel annotations, which is likely due to the homogeneous Gleason patterns in the test set with only one or two patterns per TMA. Notably, \(\textsc {SegGini}\) with \(\mathcal {IE}\) setting outperforms UZH-\(\mathrm {CNN}\) with \(\mathcal {C}\) setting.

On SICAPv2 dataset in \(\mathcal {C}\) setting, \(\textsc {SegGini}\) outperforms FSConv for both segmentation and classification, and performs comparable to the pathologist-baseline for classification. SICAPv2 is highly imbalanced with a large fraction of benign regions. Thus, \(\textsc {SegGini}\) yields better results for benign class, while relatively poor performance for Grade5, which is rare in the dataset. For the \(\mathcal {IE}\) setting, \(\textsc {SegGini}\) significantly outperforms HistoSegNet that trains using tile-labels, set the same as WSI-labels. This indicates HistoSegNet’s inapplicability to WSIs with WSI-level supervision. For \(\mathcal {IE}\), \(\textsc {SegGini}\) performs superior to [25, 30] and comparable to [18]. Combining \(\mathcal {IE}\) and \(\mathcal {IC}\) for segmentation, the complementarity of annotations substantially boosts the performance. \(\textsc {SegGini}\) with \(\mathcal {IE+IC}\) consistently outperforms \(\mathcal {IC}\) for various % of pixel annotations. Notably, \(\mathcal {IE+IC}\) outperforms \(\mathcal {C}\) while using only 50% pixels. This confirms the benefit of learning from multiplex annotations. \(\textsc {SegGini}\)’s inference time to process a WSI (11K \(\times \) 11K pixels at 10\(\times \) ) is 14 ms, comparable to CLAM (11 ms). \(\mathrm {TG}\) building takes 276.7 s including 183.5 s for superpixel detection and merging, 20.5 s for patch feature extraction and 85.7 s for RAG building. Figure 3 presents qualitative results on both datasets for various annotation settings. \(\mathcal {IE+IC}\) produces satisfactory segmentation while correcting any errors in \(\mathcal {IE}\) by incorporating scribbles. The results indicate that \(\textsc {SegGini}\) provides competitive segmentation even with inexact supervision. Thus, we can leverage readily available slide-level Gleason grades from clinical reports, to substantially boost the segmentation, potentially together with a few incomplete scribbles from pathologists.

Table 2. Results on SICAPv2 as Mean±std using complete (\(\mathcal {C}\)), inexact (\(\mathcal {IE}\)), incomplete (\(\mathcal {IC}\)), and \(\mathcal {IE+IC}\) settings. Setting-wise best scores are in bold.
Fig. 3.
figure 3

Example of predicted segmentation maps on UZH and SICAPv2 datasets for various annotation settings. \(\mathcal {I_C}\) is 10% and 25% for the datasets, respectively.

4 Conclusion

We proposed a novel \(\mathrm {WSS}\) method, \(\textsc {SegGini}\), to perform semantic segmentation of histology images by leveraging complementary information from weak multiplex supervision, i.e., inexact image labels and incomplete scribbles. \(\textsc {SegGini}\) employs a graph-based classification that can directly operate on large histology images, thus utilizing local and global context for improved segmentation. \(\textsc {SegGini}\) is a generic method that can be applied to different tissues, organs, and histology tasks. We demonstrated state-of-the-art segmentation performance on two prostate cancer datasets for various annotation settings, while not compromising on classification results. Future research will focus on studying the generalizability of our method to previously unseen datasets.