1 Introduction

Image segmentation plays an important role for extracting quantitative imaging markers of disease for improved medical diagnosis and treatment. CNNs have been shown to be promising for medical image segmentation [1]. However, they require large training sets to be able to generalize well. In medical applications, labels are often only available for limited subjects who come from a healthy group with a specific age range. Models trained on this population will not perform well in subjects from a different age group (such as newborns or children), subjects imaged on a different scanner or subjects with a specific disease. In order to generalize models, annotating more images is crucial. Due to costly efforts needed for medical annotation, active learning (AL) seems imperative enabling us to build generalizable models with the smallest number of additional annotations. Generally speaking, AL aims to select the most informative queries to be labeled among a pool of unlabeled samples.

Among AL algorithms used for medical image segmentation, uncertainty sampling has been one of the popular methods [2, 3], which queries the most uncertain samples to be labeled. It has recently been used with neural networks, where uncertainty was measured based on sample margins [4] or bootstrapping [5]. For the same purpose, Wang et al. [6] used entropy function but mixed it with weak labels. In addition, more sophisticated objectives such as Fisher information (FI) has theoretically been shown to be beneficial for active learning [7,8,9]. FI measures the amount of information carried by the observations about the underlying unknown parameter. An earlier work [10] successfully applied FI in medical image segmentation using logistic regression. However, FI based objective functions for AL have not previously been applied to CNN models mainly because of the significantly larger parameter space of deep learning models which leads to intractable computations for evaluating FI.

In this paper, we propose a modified version of FI-based AL for image segmentation with CNN. Modification of FI-based approach is towards making the queries even more informative by making them as diverse as possible. We observe that using the selected queries to fine-tune only the last few layers of a CNN can effectively improve the initial model performance, and thus there is no need for blending with weak labels. Furthermore, we leverage the very efficient backpropagation methods that exist for gradient computation in CNN models to make evaluation of FI tractable. We formulate the proposed diversified FI-based AL for the application of CNN based patch-wise brain extraction and compared it with two baselines, random sampling and entropy-based querying (uncertainty sampling), within two scenarios: semi-automatic segmentation and universal active learning. Our results show that the proposed methods significantly outperform random querying and can effectively improve the performance of a pre-trained model by querying a very small percentage (less than 0.05%) of image voxels. Finally, we show that the FI-based method outperforms entropy-based approach when active querying is used for transfer learning.

2 Methods

We explain our AL method in the context of a single querying iteration, when a parameter estimate \(\hat{{{\mathrm{\varvec{\theta }}}}}\) is already available from an initial labeled data set. We assume that the CNN model is capable of providing us with the class posterior probability \(\mathbb {P}(y|\hat{{{\mathrm{\varvec{\theta }}}}},{{\mathrm{\mathbf {x}}}})\). In each iteration, selected queries will be labeled by the expert and the model will be updated. This process repeats using the updated model. Throughout this section, \({{\mathrm{\mathcal {U}}}}=\{{{\mathrm{\mathbf {x}}}}_1,...,{{\mathrm{\mathbf {x}}}}_n\}\) denotes the unlabeled pool of samples and \(Q\subseteq {{\mathrm{\mathcal {U}}}}\) is the (candidate) query set. The goal in a querying iteration is to generate (no more than) \(k>0\) most informative queries.

2.1 FI-Based AL

Fisher information (FI), defined as \(\mathbb {E}_{{{\mathrm{\mathbf {x}}}},y}\left[ \nabla _{{{\mathrm{\varvec{\theta }}}}}\log \mathbb {P}(y|{{\mathrm{\mathbf {x}}}},{{\mathrm{\varvec{\theta }}}}_0)\nabla ^\top _{{{\mathrm{\varvec{\theta }}}}}\log \mathbb {P}(y|{{\mathrm{\mathbf {x}}}},{{\mathrm{\varvec{\theta }}}}_0)\right] \), measures the amount of information that an observation carries about the true model parameter \({{\mathrm{\varvec{\theta }}}}_0\in \mathbb {R}^{\tau }\). Trace of (inverse) FI serves as a useful active learning objective [8, 9], where it is optimized with respect to a query distribution \(\mathbf {q}\) defined over the pool \({{\mathrm{\mathcal {U}}}}\) (hence \(q_i\) is the probability of querying \({{\mathrm{\mathbf {x}}}}_i\in {{\mathrm{\mathcal {U}}}}\)). Different approximations can be introduced for tractability [7, 10]. Here, we follow the algorithm in [11] (originally used for logistic regression), which aims to solve

$$\begin{aligned} \mathop {\mathrm {arg\,min}}\limits _{\mathbf {q}\in [0,1]^n} \text {tr}\left[ {{\mathrm{\mathbf {I}}}}_\mathbf {q}({{\mathrm{\varvec{\theta }}}}_0)^{-1}\right] . \end{aligned}$$
(1)

This optimization has a non-linear objective, but it can be reformulated in the form of a semi-definite programming (SDP) problem [12].

2.2 Diversified FI-Based AL

Although (1) takes into account the interaction between different samples, it is not obvious how much diversity it includes within Q. In order to further encourage a well-spread probability mass function (PMF) and more diverse queries, we included an additional covariance-dependent term \(-\lambda \text {tr}\big [\text {Cov}_{\mathbf {q}}[{{\mathrm{\mathbf {x}}}}]\big ]\) into the objective, where \(\lambda \) is a positive mixing coefficient. Unfortunately, adding this term to the objective prevents us from forming a linear SDP. In order to keep the tractability, we constrain ourselves to zero-mean PMFs, i.e., \(\mathbb {E}_{\mathbf {q}}[{{\mathrm{\mathbf {x}}}}]=\mathbf {0}\). This constraint makes the covariance term linear with respect to \(q_i\)’s:

$$\begin{aligned} \mathop {\mathrm {arg\,min}}\limits _{\mathbf {q}\in [0,1]^n}&\text {tr}\left[ {{\mathrm{\mathbf {I}}}}_\mathbf {q}({{\mathrm{\varvec{\theta }}}}_0)^{-1}\right] -\lambda \sum _{i=1}^nq_i{{\mathrm{\mathbf {x}}}}_i^\top {{\mathrm{\mathbf {x}}}}_i \quad \text {s.t.}\quad \sum _{i=1}^nq_i{{\mathrm{\mathbf {x}}}}_i = \mathbf {0}. \end{aligned}$$
(2)

Following an approach similar to [11], we can get the following linear SDP:

$$ \begin{aligned} \mathop {\mathrm {arg\,min}}\limits _{\mathbf {q}\in [0,1]^n,\mathbf {t}\in \mathbb {R}^\tau }&t_1 + ... + t_\tau - \lambda \sum _{i=1}^n q_i{{\mathrm{\mathbf {x}}}}_i^\top {{\mathrm{\mathbf {x}}}}_i \nonumber \\ \text {s.t.} \quad&\sum _{{{\mathrm{\mathbf {x}}}}_i\in {{\mathrm{\mathcal {U}}}}} q_i{{\mathrm{\mathbf {x}}}}_i = \varvec{0} \quad \& \quad \begin{bmatrix} \sum _{i}q_i{{\mathrm{\mathbf {A}}}}_i \quad&{{\mathrm{\mathbf {e}}}}_j \\ {{\mathrm{\mathbf {e}}}}_j^\top&t_j \end{bmatrix}\succeq 0,\, j=1,...,\tau . \end{aligned}$$
(3)

where \(t_1,..,t_\tau \) are auxiliary variables, \({{\mathrm{\mathbf {e}}}}_j\) is the j-th canonical vector, and \(\mathbf {A}_i\in \mathbb {R}^{\tau \times \tau }\) is the conditional FI of \({{\mathrm{\mathbf {x}}}}_i\), defined as

$$\begin{aligned} {{\mathrm{\mathbf {A}}}}_i \, := \, \sum _{y=1}^c\mathbb {P}(y|{{\mathrm{\mathbf {x}}}}_i,{{\mathrm{\varvec{\theta }}}}_0)\nabla _{{{\mathrm{\varvec{\theta }}}}}\log \mathbb {P}(y|{{\mathrm{\mathbf {x}}}}_i,{{\mathrm{\varvec{\theta }}}}_0)\nabla _{{{\mathrm{\varvec{\theta }}}}}^\top \log \mathbb {P}(y|{{\mathrm{\mathbf {x}}}}_i,{{\mathrm{\varvec{\theta }}}}_0) \end{aligned}$$
(4)

Since \({{\mathrm{\varvec{\theta }}}}_0\) is not known, it is replaced by the available estimate \(\hat{{{\mathrm{\varvec{\theta }}}}}\). Finally, (2) can be slow when n (pool size) and \(\tau \) (parameter length) are very large, which is usually the case for CNN-based image segmentation. In order to speed up, we moderate both values by (a) downsampling \({{\mathrm{\mathcal {U}}}}\) by only keeping \(\beta \) most uncertain samples [11, 13], and (b) shrinking the parameter space by representing each CNN layer with the average of its parameters. When the querying PMF \(\mathbf {q}\) is obtained, k samples will be drawn from it and the distinct samples will be used as the queries.

3 Experimental Results

We applied the proposed method and the baselines for CNN based patch-wise brain extraction. We use tag random for random querying, entropy for entropy-based querying, and Fisher for FI-based querying with \(\lambda =0.25,\beta =200\). In entropy, we used Shannon entropy as the uncertainty measure. Our data sets contain T1-weighted MRI images of two groups of subjects: (a) 66 adolescents from age 10 to 15, and (b) 25 newborns from the Developing Human Connectome Project [14]. The CNN model used in our experiments is shown in Fig. 1. Inputs are axial patches of size \(25\times 25\times 1\). The feature vectors \({{\mathrm{\mathbf {x}}}}_i\) in (3) are extracted from the output of the second FC layer.

Fig. 1.
figure 1

Architecture of the CNN model used for brain extraction

We first trained an initial model using randomly selected patches from three adolescent subjects and used it to initialize AL experiments, where k is set to 50. Each querying iteration started with an empty labeled data set \({{\mathrm{\mathcal {L}}}}_0\) and an initial model \(\mathcal {M}_0\). At iteration i, \(\mathcal {M}_{i-1}\) was used to score samples and select the queries. Labels of the queries were added to \({{\mathrm{\mathcal {L}}}}_{i-1}\) to form \({{\mathrm{\mathcal {L}}}}_i\), which was used to update \(\mathcal {M}_{i-1}\) by fine-tuning only the FC layers. Accordingly, when computing conditional FI’s in (4), we only computed gradients for the FC layers. Next we discuss two general scenarios in evaluating the performance of AL methods.

3.1 Active Semi-automatic Segmentation

Here, the goal is to refine the initial pre-trained model to segment a particular subject’s brain by annotating the smallest number of additional voxels from the same subject. For the sake of computational simplicity, we used grid-subsampling of voxels with a fixed grid spacing of 5, resulting in pool of unlabeled samples with size \(\sim \)200,000 for adolescent and \(\sim \)350,000 for newborn subjects. We evaluated the resultant segmentation accuracy for the specific subject after each AL iteration over grid voxels. We also reported the initial/last segmentations over full voxels after post-processing the segmentations with CRF (for newborns), Gaussian smoothing (with standard deviation 2), morphological closing (with radius 2) and 3D connected component analysis.

Table 1 shows mean and standard deviation of F1 scores in different querying iterations from 25 newborns and 63 adolescents (after excluding three images used in training \(\mathcal {M}_0\)). This table shows that Fisher and entropy raised the performance significantly higher than random, and increased the initial F1 score by labeling less than 0.05% of total voxels. Whereas, random decreased the average score in the early iterations, which implies potential negative effect of bad query selection. This table shows a slight difference between Fisher and entropy when considering all the images collectively. However, we observed that Fisher actually outperformed entropy in more than 60% of the newborn subjects (16 out of 25), while performing almost equally on the others. Figure 2(a) shows box plots of the difference between F1 scores of Fisher and entropy for these two groups of subjects, where the white boxes are mostly in the positive side.

Table 1. F1 scores of the models obtained from querying iterations of different AL algorithms. The scores of intermediate querying iterations are based on grid samples, whereas the initial and final scores are reported based on full segmentation.

The improvements in F1 scores are shown for two selected subjects, one from each group, in Figs. 2(b) and (c). Furthermore, in order to visualize how differences in F1 scores may reflect in segmentations, we also showed in Fig. 3 segmentation of a slice of the subject associated with Fig. 2(b). Observe that the pre-trained model from adolescent subjects falsely classified skull as brain, since brains of adolescent and newborn subjects look very different in their T1-weighted contrast. After AL querying, the methods could better distinguish these regions but random and entropy have much more false negatives than Fisher.

Fig. 2.
figure 2

F1 scores reported separately for two groups of newborn subjects, when Fisher\(\,>\,\) entropy and Fisher\(\,\approx \,\) entropy. The box-plots consider all subjects in each group, whereas the F1 curves in (b) and (c) are for one sample subject from each group.

Fig. 3.
figure 3

Segmentation of a slice using \(\mathcal {M}_0\) and models obtained in active semi-automatic segmentation of the newborn for which F1 curves are shown in Fig. 2(b). Green boundaries show the ground-truth segmentation and red regions are the resulting brain extraction. (Color figure online)

3.2 Universal Active Learning

In this section, we used FI-based AL sequentially on a subset of new subjects to further improve the initial CNN model in order to achieve a universal model that can be used to segment all other subjects in the same data set. The goal was to show that FI-based querying method is able to result a more generalizable model. We ran a sequence of FI-based AL over 11 subjects in each data set, such that the initial model of querying iterations over one subject was the final model obtained from the previous subject. The pre-trained model \(\mathcal {M}_0\) described above was used to initialize the AL algorithm for the first image. For each subject, we continued running the querying iterations with \(k=50\) until 1,500 queries were labeled. The resulting universal model was then tested on the remaining unused subjects in the data set. Note that for the newborn dataset the problem is a transfer learning scenario, where an initial pre-trained model from the adolescent data set was updated using the proposed AL approach to achieve improved performance in the newborn dataset. Results from test subjects reported in Fig. 4 show that the initial model is significantly improved after labeling a very small portion (less than 0.02%) of the voxels involved in the querying.

Fig. 4.
figure 4

Statistics of F1 scores of universal models resulting from sequence of FI-based querying over 11 images and the initial model \(\mathcal {M}_0\) over the test images of adolescent and newborn subjects. The box-plots and histograms show that except for a few adolescent outliers, the F1 scores are significantly increased by our proposed FI-based AL.

4 Conclusion

In this paper, we presented active learning (AL) algorithms based on Fisher information (FI) for patch-wise image segmentation using CNNs. In these new algorithms a diversifying term was added to the querying objective based on the FI criterion; where efficient FI evaluation was achieved using gradient computations from backpropagation on the CNN model. In the context of brain extraction, the proposed AL algorithm significantly outperformed random querying. We also observed that FI worked better than entropy in transfer learning, where we actively fine-tuned a pre-trained model to adapt it to segment images from a patient group with different characteristics (age, pathology, scanner) than the source data set. FI-based querying was also successfully applied for creating universal CNN models for both source (adolescent) and target (newborn) data sets, to label minimal new samples while achieving large improvement in performance.