Keywords

1 Introduction

The max-tree structure [16, 21] allows to perform efficiently attribute filtering  [7] and has been involved in many image processing applications. The resulting operators are called connected [22] since they do not create new contours nor modify their position. In [3], Farfan et al. have suggested that max-tree attributes could be used to feed a deep convolutional neural network (CNN) in order to improve the results of detection and segmentation tasks. Following and generalizing this approach, the aim of this paper is to provide a reproducible framework enabling to perform various experiments involving max-trees and CNN in a context of semantic segmentation of cellular FIB-SEM images.

Our contributions are twofold: in a first approach, input images are preprocessed using various attribute filters [12] and then concatenated as additional inputs of a CNN. In a second approach, maps of attributes are computed from the max-tree and then added in a CNN, following Farfan et al. approach.

Finally, our work aims to be handy. For this purpose, all the methods we propose can be used on a high-end workstation and do not require large GPU/TPU clusters. Also, our source code, datasets (original images, annotations) and documentation are publicly available, allowing everybody to reproduce the results, but also to reuse the code for their own needs.

2 State of the Art

To address the segmentation of cellular electron microscopy images, the state-of-the-art methods are currently based on CNN [2, 5, 9, 12, 18, 19, 24] and the U-Net architecture remains mainly used. However, despite the good accuracy that can be obtained using these methods, the resulting segmentations can still suffer from various imperfections. In particular, thin and elongated objects such as endoplasmic reticulum can be disconnected and some parts may be distorted [12]. These effects may be the result of the context window that is fixed and too narrow in the first layer of the CNN, preventing to capture sufficient global information.

To overcome this, Farfan et al. [3] have proposed to enrich a CNN with attributes computed from the max-tree, enabling to capture at a pixel level, an information that may be non-local.

In the sequel of this paper, we will explore various strategies in order to incorporate max-tree attributes into CNN with the aim of potentially improving segmentation results.

3 Methods

3.1 Max-Tree

Let \(I:E \rightarrow V\) be a discrete, scalar (i.e. grayscale) image, with \(E\subseteq \mathbb Z^n\) and \(V\subseteq \mathbb Z\). A cut of I at level v is defined as: \(X_v(I) = \{p \in E | I(p) \ge v\}\). Let C[X] be the set of connected components of X. Let \(\varPsi \) be the set of all the connected components of the cuts of I:

$$\varPsi (I)=\bigcup _{v\in V}\{C[X_v(I)]\}$$

The relation \(\subseteq \) is a partial order on \(\varPsi \). The transitive reduction of the relation \(\subseteq \) on \(\varPsi \) induces a graph called the Hasse diagram of \((\varPsi ,\subseteq )\). This graph is a tree, the root of which is E. The rooted tree \(\mathcal T=(\varPsi ,L, E)\) is called the max-tree of I, with \(\varPsi ,L,E\) being respectively the set of nodes, the set of edges and the root of \(\mathcal T\). The parent node of N, denoted Par(N), is the unique node such that: \((Par(N),N)\in L\) and \(N \subseteq Par(N)\) for \(N\ne E\). The branch associated to a node is the set of its ancestors and is defined for a node \(N\in \varPsi \) by: \(Br(N)=\{X\in \varPsi \mid X\supseteq N\}\).

In this work, the computation of the max-tree is based on the recursive implementation of Salembier [21], and node attributes are computed during the construction of the tree. In the rest of this paper, we will focus on the following attributes which have been proposed in the literature:

  • The height H is the minimum gray level of the connected component [21].

  • The area A is the number of pixels in the connected component [21].

  • The contour length CT represents the number of pixels that have both a neighbor inside and outside the component [20].

  • The contrast C is the difference between the maximum and minimum grey-level in the connected component [21].

  • The complexity CPL represents the contour length CT divided by the area A [20].

  • The compacity (sometimes named compactness or circularity) CPA is the area A divided by the square of the contour length \(CT^2\) [22].

  • The volume V is the sum of the difference between the pixels values in the node and the node height [16].

  • The mean gradient border MGB represents the mean value of the gradient magnitude for the contour pixels of the component [3].

The tree attributes can be merged in order to compute an image, by associating to each pixel an attribute value computed from its corresponding nodes [3]. Each pixel p belongs to several nodes: the connected component N including p in the level-set \(X_{I(p)}(I)\) and all the nodes belonging to its branch Br(N). To associate a unique value to each pixel, different policies can be implemented: for example, by keeping the maximum, the minimum or the mean value of the attributes of the branch nodes [3].

In this work, we propose the following strategy. For each pixel p, the set of nodes belonging to the branch of p is retrieved, and only a subset of nodes having an attribute value in a certain range (given as a parameter) is kept. From this set, the node \(N_{best}\) optimizing a certain stability criterion is kept. Finally, the value of p in the resulting image is set to the attribute value of \(N_{best}\). The resulting image is normalized in the range \(V=\llbracket 0, 255 \rrbracket \).

The criterion used to retrieve the optimal node is based on the concept of Maximally Stable Extremal Regions proposed by Matas et al. [10]. The idea is to retrieve the most stable regions based on the area variation between successive nodes since these regions represent salient objects of the image. For each node \(N\in \varPsi \), with \(N\ne E\) (i.e. different from the root), we define two stability attributes as follows:

$$\begin{aligned} \nabla _A(N) = \frac{\mid A(Par(N)) - A(N)\mid }{|H(Par(N))-H(N)|} \cdot \frac{1}{A(N)} \end{aligned}$$
$$\begin{aligned} \varDelta _A(N) = | \nabla _A(Par(N)) - \nabla _A(N) | \end{aligned}$$

where Par(N) defines the parent node of N.

3.2 Segmentation

We base our segmentation method on fully convolutional networks for semantic segmentation. We use attribute filtered images or max-tree attribute maps as additional input of our network. We feed the network with the original and the preprocessed images, concatenating them in the color (spectral) channels.

For model architectures, we use a 2D U-Net [19], which is a reference for biomedical image segmentation. A 3D U-Net [2] could also be used, but the results are not necessarily better [17, 23, 25] and the computational cost of training the model is much higher. In a preliminary experiment, we have compared 2D and 3D models, using an equal number of parameters and same input size and the 2D U-Net performs as well, if not better than the 3D one.

Fig. 1.
figure 1

U-Net architecture and its backbone block. The number in the box corresponds to the number of filters at this level for the convolutions blocks. The network is fed with the original and the preprocessed images, concatenating them in the color channels.

Each block of the network is composed of convolutions with ReLU activation, followed by batch normalization [6] and a residual connection [4], see Fig. 1. We use a 50% dropout entering the deepest block to avoid over fitting. We always use padded convolution to maintain the spatial dimension of the output. The model starts with 64 filters at the first level, for a total of 32.4 millions parameters.

We train our models by minimizing the Dice loss. For the binary segmentation task, we use classical Dice loss and for multi-class segmentation problems, we use a weighted mean of the loss for each class, with the same weight (\(W=0.5\)) for our two classes. We note X the ground truth, Y the prediction, W the weight list and C the class list. The \(\varepsilon \) term is used for stability when \(\sum {(X + Y)} = 0\) and is set to \(10^{-4}\).

$$ L_{Dice}(X, Y) = 1 - \frac{2 \cdot \sum {X \cdot Y} + \varepsilon }{\sum {(X + Y)} + \varepsilon } $$
$$ L_{DiceMean}(X, Y, W) = \frac{1}{|C|}\sum _{c \in C}{W_c \cdot L_{Dice}(X_c, Y_c)} $$

The model is trained for 128 epochs, each epoch is composed of 512 batches, a batch is composed of 8 patches and a patch is a \(256 \times 256 \times C\) subpart of the image, with C the number of channels. We use random \(90^{\circ }\) rotation, horizontal and vertical flips for data augmentation on the patches. We trained our model using Adam [8] optimizer with the following parameters: \(\alpha =0.001\), \(\beta _1=0.9\), \(\beta _2=0.999\), \(\varepsilon =1\textrm{e}{-07}\).

3.3 Evaluation Metrics

To evaluate our models, we binarize the model prediction with a threshold at 0.5. To predict a slice, we use the whole slice to avoid the negative border effects of padding.

To evaluate our results, we use the F1-Score which is a region-based metric, and the average symmetric surface distance (ASSD) which is a boundary-based metrics. In fact, the F1-Score is equivalent to the Dice Score. We note respectively TP, FP and FN the cardinal of the sets of true positives, false positives and false negatives. X is the ground truth, Y the binary prediction and \(\partial X\) the boundary of X.

$$ \text {F1-Score} = \frac{2 \times TP}{2 \times TP + FP + FN} $$
$$ \text {ASSD}(X, Y) = \frac{\sum _{x \in \partial X}{d(x, Y)} + \sum _{y \in \partial Y}{d(y, X)}}{|\partial X| + |\partial Y|} $$

with \(d(x, A) = \min _{y \in A}{{\Vert x - y \Vert }_2}\).

4 Experiments

In this section, we test the improvement of the segmentation thanks to the addition of filtered images in the input. We compare the original image input with the enriched version. We also compare in the same time a multi-class segmentation model and two binary segmentations models. Finally, we repeat each of these configurations 11 times. In total, 363 models have been trained for this experiment, each training lasts about 5 h with an NVIDIA GeForce RTX 2080 Ti.

First, we define the filters we use as our experiment variable. Attributes are selected for their potential help for the segmentation, we list the selected filters in Table 1 (attribute maps strategy) and 2 (connected operators strategy), and renamed them for sake of simplicity.

Table 1. List of used attribute maps filters and their names.
Table 2. List of used connected filters and their names. The “inverse” column is checked when filter is applied on the inverted image.

Before processing the image with a max-tree, we apply a low pass filter (\(9 \times 9\) mean filter).

4.1 Data

We perform our experimentation on a stack of 80 slices from a 3D FIB-SEM image. Each slice has a size of \(1536 \times 1408\). The image represents a HeLa cell and has an \((x\times y\times z\)) resolution of 5 nm \(\times \) 5 nm \(\times \) 20 nm. A ground truth is available on the stack for two kinds of organelles (i.e. cell subunits) : mitochondria and endoplasmic reticulum. A default background class is affected to non-assigned pixel. An example slice with label is available in Figs. 3 and 4 in Sect. A.2. Figures 5 to 14 depict the slice for each applied filter.

We divide the stack into 3 sets: training (first 40 slices), validation (next 20 slices) and test (last 20 slices). The training set is used to train the network, the validation set to select the best model during the training and the test set to provide evaluation metrics.

4.2 Results

Figs. 2a and 2b depict the F1-score on the two classes of segmentation as box plots. Detailed mean and variation score are available in Table 3 and 4 in Sect. A.1.

Fig. 2.
figure 2

The box shows the quartiles of the results, while the whiskers extend to show the rest of the distribution.

Baseline Segmentation Results. Mitochondria are well segmented with a median F1-score up to 95% in multi-class segmentation and 94% in binary segmentation, which let a little possible improvement. Median F1-scores for reticulum up to 72% and 71% in multi-class and binary segmentation respectively. This thin organelle is indeed more difficult to segment.

Additional-input segmentation results On the mitochondria, the additional inputs improve the results for 11 cases out of 20 and 17 out of 20 for the reticulum. The gain on the reticulum is very interesting since it is the more difficult to segment for the baseline setup. The following additional inputs improve the result in all the four tests (binary and multi-class segmentations on mitochondria and reticulum): Contrast\(_{\varDelta _A}\), Complexity\(_{\varDelta _A}\), Contrast \(\beta \). Moreover, the whiskers show that only Contrast\(_{\varDelta _A}\)and Contrast \(\beta \) have a good stability in this experiment. These two inputs are therefore good candidates to be additional inputs to improve segmentation results. On the contrary, Compacity\(_{\varDelta _A}\)and MGB do not yield to any improvement.

4.3 Reproducibility

In this section, we present the required steps to reproduce the results we presented.

We follow the ACM definition of reproducibility: “The measurement can be obtained with stated precision by a different team using the same measurement procedure, the same measuring system, under the same operating conditions, in the same or a different location on multiple trials. For computational experiments, this means that an independent group can obtain the same result using the author’s own artifacts.” [1]

For this purpose, our codes and datasets are publicly available. The project is split between three subprojects. First, the experimentation detailed documentation, training scripts, evaluation scripts, preprocessing script, results logs [11]. Second, the max-tree related functions and preprocessing script [14]. Finally, the dataset with images and annotations [13]

The following information are also available on the documentation repository with more details.

Requirements. A system with Ubuntu 18.04.6 (or compatible) with g++ and git installed. Python 3.6.9 with an environment including TensorFlow 2.6.2, NumPy, SciPy, scikit-image and MedPy.

Image preprocessing

  • Prepare data for extraction with low pass filter. python 01_mean_filter.py

  • Extract attribute image from pre-processed images. ./build_bin_requirements.sh ./02_attribute_image.sh

  • Crop the image to the annotated area and construct a tiff stack. python 03_crop_roi.py

Network Training and Evaluation. For the following commands, $ID is a unique identifier for the train, $INPUT is the folder containing the dataset, $OUTPUT is the folder containing the trained models and evaluation metrics, $DATASET select the dataset to use (in our case, binary or multi-class dataset), $SETUP select the experiment to run. An automation bash script is available on the repository to run all the 33 setups once.

  • Train the networks python train.py $ID $INPUT $OUTPUT $DATASET $SETUP

  • Evaluate the networks python eval.py $ID $INPUT $OUTPUT $DATASET $SETUP BEST

Result Analysis and Figures Reproduction. Since the output of each model evaluation is a comma separated values (CSV) file, the analysis of the results can be done using various tools. We propose to use a Jupyter notebook with Pandas and Seaborn, merging the CSV files into a single dataframe. An example analysis.ipynb notebook is provided on the GitHub, which we use to produce our result figures and tables.

5 Conclusion

In this paper, we have presented a detailed experimental setup to evaluate the use of additional input in a CNN based segmentation task. The additional inputs are attribute maps obtained from a max-tree representation of the image. The evaluation is made on segmentation tasks in the context of 3D electronic microscopy. If most of the additional inputs improve one segmentation task, two of them – namely Contrast\(_{\varDelta _A}\)and Contrast \(\beta \)– improve all the tested segmentation tasks in terms of median F1-score and stability. Further than the segmentation results, the setup inspired from [3] has been entirely implemented in C++ and Python and is proposed in open access to make it reproducible.

As a perspective of this work, the feature extraction method based on max-tree attributes presented in this paper could be used for other applications. For example, it would be interesting to compute the max-tree directly inside the model and to use the attributes images as a nonlinear filter. Also, the attribute maps could be used as feature maps for more simple and explainable classifier as random forests or even decision tree. Besides, the \(\varDelta _A\) attributes we defined could be used in an interactive segmentation setup, where a single pixel will allow to select an interesting node, selecting an object connected component over a background. Finally, we proposed here a max-tree based method, but an extension to the tree of shapes [15] could be interesting and add more information to the image.