Keywords

1 Introduction

Nucleus type information is essential in many pathology diagnoses [4, 9]. In many settings, the presence and portion of certain types of nucleus are used to assess the proliferation rate, subtypes or grade of the diseases [4, 13]. Traditionally, nucleus classification is treated as a supervised classification problem [3, 4, 9] and deep neural networks have achieved rather satisfactory performance. However, the superiority of supervised deep learning usually heavily relies on the availability of massive manually annotated data. As well known that large- scale annotation for medical data is expensive and time consuming, e.g., diagnostic pathology images, while large-scale unlabeled data are relatively easy to obtain. To alleviate the high demand for manual annotation, semi-supervised deep learning (SSDL) has been developed to learn from a small portion of labeled data and large-scale unlabeled data. Recently, self-ensembling (SE) based semi-supervised learning has attracted broad attention [7, 8, 11]. The intuition of SE method is to enforce a prediction consistency for each training sample under different perturbations. Such consistency is not dependent on label information, and is able to extract extra semantic information from the unlabeled data. One of the successful SE method is called temporal ensembling (TE) [5]. In TE, for each unlabeled sample, an exponential moving average (EMA) of the prediction within multiple previous training epochs is computed as the proxy target. A mean square error (MSE) between the predictions and the proxy targets is used as the consistency loss. The proxy targets are the ensembled predictions of those from many previous epochs, thus serve as stronger proxy labels that provide extra semantic information in addition to the labeled data. However, TE requires to maintain a matrix of size \(N \times C \), where N denotes the number of training samples, including labeled and unlabeled data, and C is the number of classes. This requirement makes TE model heavy when learning on large datasets. To alleviate this problem, Mean Teacher (MT) [10] utilizes two models (student and teacher models). Instead of maintaining the EMA of the proxy labels, MT method maintains a teacher model as the EMA of the student model. In each minibatch evaluation, the output of the teacher model is used as the proxy target. Since such proxy target is generated by the EMA model aggregated from many student models, it provides better proxy targets.

One aspect ignored by the aformentioned SE methods is the intrinsic structure of data. That is the local and global consistency widely existing in many datasets [2, 12]. Local consistency refers to that samples from the same class are likely to lie in the same vicinity in the feature space. Global consistency means that samples from the same global structure are likely to share the same label. To enforce the local and global consistency, in this paper, we propose a novel loss function that is computed over a graph constructed via label propagation (LP) [14]. Specifically, we utilize the LP algorithm to iteratively propagate the label information from the labeled samples to the unlabeled ones based on the local structure until a global stable state is reached, then construct a graph based on the LP predicted labels. Next, Siamese loss is employed to pull the data from same class closer and push those from different classes further away. Therefore, the two consistencies are enforced. Experiments on two nucleus classification datasets illustrate the superior performance of the proposed method over the recent state-of-the-art SE methods.

2 Mean Teacher with Label Propagation

2.1 Preliminaries

Since our method is based on mean teacher (MT) [10], we first briefly introduce mean teacher in this subsection. Let \(\mathcal {X}_l=\{ x_1, x_2, \cdots , x_n \} \subset \mathbb {R}^m \) denote the labeled data and \(\mathcal {X}_u=\{ x_{n+1}, x_{n+2}, \cdots , x_N \} \subset \mathbb {R}^m \) denote the unlabeled data. The system consists of two networks, i.e., the student network and the teacher network. The parameters of the teacher network is the EMA of the student network computed by: \(\theta '_{\tau } = \alpha \theta '_{\tau -1}+(1-\alpha )\theta _{\tau }\), where \(\alpha \) denotes the EMA coefficient, and \(\theta \) and \(\theta '\) represent the parameters of the student model and the teacher model, respectively. \(\tau \) represents the global training iteration. The student network is updated by the following loss:

$$\begin{aligned} Loss_{mt} = \frac{1}{n} \sum _i^n (-y_i \log f_{\theta }(x_i)) + w(\tau ) \lambda _{EMA} \mathbb {E}_{x,\eta , \eta '} [\Vert f_{\theta '}(x_j, \eta ') - f_{\theta }(x_j, \eta ) \Vert ], \end{aligned}$$
(1)

where \(\lambda _{EMA}\) is the coefficient controlling the strength of consistency between predictions of the same sample under different perturbations represented by \(\eta \) and \(\eta '\). \(w(\tau )\) is a ramp function of the global iterations \(\tau \). The first term is the cross-entropy loss for the labeled data and the second term enforces the consistency between the predictions of the student network \(f_{\theta }(x, \eta )\) and the teacher network \(f_{\theta '}(x, \eta ')\). The consistency term is computed on all the data.

Fig. 1.
figure 1

Each minibatch consists of both labeled and unlabeled samples. The LP predicted labels and the ground truth labels are used to construct a graph capturing the local and global structure of the data. A Siamese loss is computed based on the graph. The student network is updated by a hybrid loss consisting of classification loss, consistency loss and the Siamese loss.

2.2 Local and Global Consistency Regularized Mean Teacher

As mentioned before, the MT method ignores the connection between the samples thus fails to extract more semantic information from the unlabeled data. In the proposed method, for each minibatch, LP is first conducted on the intermediate level features from the teacher network. This is because the teacher network is an ensemble model that is supposed to generate better feature embedding. Then a graph is constructed using the ground truth labels and the LP predicted labels. Next, a Siamese loss is calculated based on the graph using the features generated from the student network. Finally, a novel hybrid loss, including the loss Eq. (1) and the Siamese loss, is used to update the student network. An overview of our proposed system is depicted in Fig. 1.

Label Propagation: Label propagation [14] is a transductive semi-supervised learning algorithm. It propagates label information from the labeled data to the unlabeled data based on the affinity matrix of the data. The basic idea is that the data close to each other are more likely to share the same label. Therefore, the LP procedure computes the label of an unlabeled data as the weighted sum of the labels of its neighbors. Through an iterative procedure, the label can be propagated from the labeled data to their neighbors, and the neighbors of neighbors. Finally, the unlabeled data are assigned labels that respect the global structure of the data. The LP algorithm is proven to converge. More details of the proof can be found in [14].

Graph Based Clustering Loss: With the LP predicted labels for the unlabeled data, the pairwise connection information between the data points are known. With this information a graph can be built by:

$$\begin{aligned} A_{ij} = {\left\{ \begin{array}{ll} 1, &{} \text {if } y_i = y_j,\\ 0, &{} \text {otherwise}, \end{array}\right. } \end{aligned}$$
(2)

where \(y_i\) denotes the LP predicted labels for unlabeled data \((j\le n<i \le N)\) and the ground truth labels for the labeled data \((i,j\le n)\). To enforce the local and global consistencies, we propose to use the contrastive Siamese loss [1] to pull the samples within the same class closer and push those from different classes further away:

$$\begin{aligned} L_s = {\left\{ \begin{array}{ll} \Vert z_i - z_j \Vert ^2, &{} \textit{if } A_{ij} = 1,\\ \max (0, m - \Vert z_i - z_j \Vert ^2), &{} \textit{if } A_{ij} = 0, \end{array}\right. } \end{aligned}$$
(3)

where \(z_i\) represents the feature vector from the intermediate layers of the student network and m is a hyperparameter. The final proposed loss function is:

$$\begin{aligned} L_{total} = Loss_{mt} + w(\tau ) (\lambda _{g1} \sum _{x_i,x_j \in \mathcal {X}_l} L_{s1} + \lambda _{g2} \sum _{x_i \in \mathcal {X}_l, x_j \in \mathcal {X}_u} L_{s2}), \end{aligned}$$
(4)

where \(\lambda _{g1}\) denotes the weight of the Siamese loss computed on the labeled samples, and \(\lambda _{g2}\) represents the weight of the Siamese loss computed on both unlabeled and labeled data. Since the LP does not change the labels of the labeled samples, the Siamese loss \(L_{s1}\) ensures that there is always some correct information for learning. Note that we do not compute Siamese loss between the unlabeled samples. This is because the LP-predicted labels are very noisy. Including them in the loss could harm the training.

3 Experiments

To evaluate our method, we conduct experiments on two datasets, including the MoNuseg dataset [9] and our own Ki-67 nucleus dataset. In the MoNuseg dataset, there are four types of nucleus, (i.e., Epithelial, Inflammatory, Fibroblast, Miscellaneous). In the Ki-67 dataset, there are also four types of nucleus, including immunopositive (non-)tumor nucleus and immunonegative (non-)tumor nucleus. The MoNuseg dataset contains 22462 nuclei and the Ki-67 dataset contains 17516 nuclei. For both datasets, \(80\%\) nuclei images are used for training, \(20\%\) of the training data is used for validation, and the rest are used for testing. A few samples of each type of nucleus are shown in Fig. 2(a).

We compare our method against two state-of-the-art SSDL methods, i.e.,TE [5] and MT [10], and a baseline fully supervised training method using the labeled data only. For each comparison, we train the different methods using only \(x\%\) \((x=\{5, 10, 25, 50\}\) for the MoNuseg dataset, and \(x=\{1, 5, 10\}\) for the Ki-67 dataset) of the training data as labeled data and the rest as unlabeled data. In fully supervised setting, the same network is trained using the labeled data only. Additionally, since the MoNuseg dataset is a publicly available dataset, we also show two results reported in [9], i.e., CNN-SSPP and CNN-NEP. These two methods are fully supervised methods. In all the comparisons, weighted average \(F_1\) score is used as evaluation metric. For the semi-supervised settings, and we report the average \(F_1\) scores and their standard deviations of 5 runs on the testing data. In each of the 5 runs, a different set of labeled data are randomly selected.

Fig. 2.
figure 2

(a) Some sample nuclei from the two datasets. (b) The \(F_1\) scores of each class in the MoNuseg data obtained using 25% training data.

3.1 Implementation Details

Network Architecture. In this paper, we adopt a network similar to the one used in [10]. The difference is the kernel size of the last two convolutional layers are set to 3. The input noise layer, ZCA layer, mean-only batch normalization are omitted. The advantage of our choice is that every component in our network can be implemented using standard Pytorch functions and scikit-learn package.

The features used in label propagation and Siamese loss are extracted from the intermediate layer (Fig. 1). Such design is chosen empirically. LP is conducted for each minibatch to build a connection graph. The Siamese loss based on the graph is computed in two terms Eq. (4). Specifically, the summation on \(L_{s1}\) is computed on 30 labeled data, and the summation on \(L_{s2}\) is from this 30 labeled data and 30 randomly selected unlabeled data. The coefficients for these two terms are shown in Table 1. The time complexity of LP is \(k\mathcal {O}(CN^2)\), where k denotes the number of iterations, and C denotes the number of classes, and N denotes the minibatch size. With such overhead, our model can still be trained within 6 h on a GTX 1080 Ti GPU.

Hyperparameter Selection. Mostly we follow the parameter settings used in MT method [10]. The learning rate and the ramp function \(w(\tau )\) are ramped up and down during the 150000 global steps. Specifically, they are ramped up in the first 40000 global steps, then kept constant for the following 85000 global steps, and finally decreased to 0 in the last 25000 global steps. We use \(w(\tau ) = e^{-5(1-\tau /150000)^2}\) as the ramp-up function and \(w(\tau ) = e^{-12.5(\tau /150000)^2}\) as the ramp-down function. The other parameter setting in our method are shown in Table 1.

Table 1. Hyperparameter selection.

3.2 Results and Analysis

Tables 2 and 3 illustrate that our method outperforms the state of the arts, especially when using less labeled data. For the MoNuseg dataset (Table 2), our method achieves around 2% higher \(F_1\) scores compared to MT and TE methods when using 5% and 10% of the training data. Along with the increase of labeled data used, the performance of all the semi-supervised methods converges. In comparison with the baseline fully supervised method using labeled data only, our performance is higher by large margin. In contrast to the results reported in [9], our method outperforms CNN-SSPP using only 5% labeled data and achieves the performance close to CNN-NEP using only 25% labeled data. It is worth note that our method and CNN-SSPP take the nucleus patch as the sole input while CNN-NEP takes into account the contextual information around the nucleus. This means CNN-NEP is actually using more labeled data. Moreover, the contextual information around the nucleus may not be a general approach for all nucleus classification problems. CNN-SSPP and CNN-NEP are both fully supervised methods. They are listed in the column 50% labels in Table 2, because they are based on two-fold cross validation [9]. Since the MoNuseg dataset is an imbalanced dataset, we show a comparison of the \(F_1\) scores for each class in Fig. 2(b). Finally, to demonstrate the effect of the graph based clustering loss, we show the feature embedding of the MoNuseg testing data in Fig. 3.

Fig. 3.
figure 3

Embeddings of the MoNuseg testing data projected to 2D space using UMAP [6]. (a) The feature embedding obtained by MT. (b) The embedding obtained by our method.

For the Ki-67 dataset, we observed similar behavior. Our method outperforms the MT, TE and fully supervised method. Ablation studies are designed to show the effect of our proposed graph based clustering loss. Since the graph based clustering loss consists of two parts: (i) \(L_{s1}\) computed on the labeled data only; and (ii) \(L_{s2}\) computed between the unlabeled data and the labeled data. We train our model with one of the two losses removed and show the performance in Table 3. It can be seen that the performance drops if either one of them is removed. This shows the advantage of learning from a graph constructed on both labeled and unlabeled data.

Table 2. \(F_1\) ± std over 5 runs on MoNuseg dataset [9].
Table 3. \(F_1\) ± std over 5 runs on Ki-67 dataset.

4 Conclusion

In this paper, we presented a novel semi-supervised deep learning method for nucleus classification. The proposed method is a type of self-ensembling based deep learning methods with additional regularization from the local and global consistency criteria. The consistencies enable the framework to learn a better distance metric such that the resultant model outperforms the state-of-the-art self-ensembling methods on two nucleus classification datasets. The proposed approach is general for image classification, thus can be easily adapted for many other image classification tasks.