1 INTRODUCTION

Considering the application of artificial intelligence methods and especially deep neural networks (DNN) for reconstructing brain electron microscopy (EM) data during the last ten years, we begin with the publication [1] of 2010. This paper actually announced the beginning of the use of serial block scanning electron microscopy as a source of high-resolution three-dimensional nanohistology for cells and tissues. A subsequent series of works was aimed at creating datasets for training deep learning networks and DNN methods and models for EM data segmentation designed for binary segmentation of brain cell organelles—neural membranes [2] and supervoxel segmentation of mitochondria [3]. Simultaneously, the problem of 3D reconstruction of the brain neural network and the problem of brain connectomics on the basis of neuron organelles and connections between neurons (synapses) is stated [4]. In this problem, of particular importance is the segmentation of such oganelles as postsynaptic densities (PSD), vesicles, and axons.

In [5], a team of 24 authors involved in the organization of the first international competition in 2D-segmentation of brain EM images claims that already at the conference on connectomics in 2014 organized by the Howard Hughes Medical Institute and Max Planck Society it became clear that convolutional networks became a dominating approach to detecting cell boundaries in serial EM images. The authors also suggest focusing on 3D processing of EM images and joined efforts in connectomics; however, they note that even the best modern algorithms for 3D reconstruction still require significant manual correction effort, which is available only for crowd sourcing. This opinion is supported by the earlier paper [6] of 21 authors from leading USA universities, which reports about the creation by joined effort of a saturated 3D reconstruction of a small (0.13 mm3) portion of an EM mouse neorcortex and a database of 1700 synapses in this portion.

The invention of U-Net in 2015 [7] opened a series of novel models and adaptations for segmenting brain EM data. The source of U-Net success is in involving the contextual information of the input image at all levels of processing. Almost immediately, the publication [8] experimentally confirmed that the skip connection of the U-Net architecture is effective for solving segmentation problems in biomedicine. U-Net also provided a basis for creating models with parallel inputs that make it possible to use correlation between inputs and, in particular, between EM layers in the 3D space [9, 10]. Next, attempts were made to use the capabilities of 3D convolutions for a multiple increase of the context amount in U-Net and U-Net-like networks—3DU-Net [11] (2016), V-Net [12] (2016), DeepMedic [13] (2017), and HighRes3DNet [14] (2017). This also gave a considerable effect, since the amount of context data for the 3D neighborhood of radius one of a voxel increases by a factor of three, and for the neighborhood of radius two of a voxel it increases by a factor of five.

An interesting direction of development of semantic segmentation implemented using fully convolutional networks is described in [15, 16]. The latter paper is most interesting and promising. For reconstructing the 3D interconnections of a system of neurons, a novel deep contextual network with a threefold reduction in resolution is proposed, which analyzes multiscale contextual information in a hierarchical structure of resolutions. The network architecture includes auxiliary classifiers that analyze the semantic meaning of image hierarchy and restrict themselves to low-level contextual features. As a dataset for the segmentation problem, ISBI 2012 is used. This method is aimed at minimizing human involvement and demonstrates a drift to explainable artificial intelligence (XAI).

Table 1. Open labeled electron microscopy datasets

The advantages of using 3D data analysis are undeniable; however, the use of 3D convolutional neural networks (CNN) with a 3D convolution kernel significantly increases the number of training parameters, computational cost, and memory consumption, which is especially sensitive for GPU applications. For this reason, the architectures using 3D convolutions are gradually replaced by the architectures that decrease the number of training parameters, the amount of memory, and increase the training speed, while preserving the quality of training and regulating balance between networks with 3D and 2D convolutions. In this process, various preprocessing methods are usually used, which often give an effect of 5% or more [1719]. For example, in [20] contrast is enhanced using the adaptive gamma correction with wait distribution (AGCWD) [21]. Another trend is the factorization of low-rank convolutional kernels [2225].

The paper [26] of 2019 reports the creation of a UNI-EM system with an interface convenient for subject matter experts. After labeling a small number of training samples, the system uses 2D and 3D deep learning networks and produces a segmentation of brain EM images for correcting the labeling and training parameters. UNI-EM comes with a set of 2D DNNs—U-Net, ResNet, HighwayNet, and DenseNet.

The paper [27] of 2019 determines the best version of U-Net using as an example the detection of vesicles in the data EM Transmission Electron Microscopy (TEM) with the resolution of 1.56 nm (two–three times better than the usual one) by comparing U-Net and Fully Residual U-Net (FRU-Net) architectures. It is found that the latter one improves accuracy by 4–5%. In the case of binary classification on three different datasets TEM, the error for FRU-Net did not exceed 10%. For the U-Net, the errors were 17, 27, and 17%.

The paper [28] of 2021 investigates the capabilities of Fully Residual U-Net (FRU-Net) with four levels resolution reduction (the original resolution of 640 × 640 is reduced four times by a factor of two to 40 × 40) using binary 2D segmentation of cell membranes as an example. Augmentation that increased the dataset by eight times due to rotations and reflections is created. On Drosophila EM dataset (ISBI 2012 EM segmentation challenge leaderboard, June 2020), the accuracy of about 98–99% of segmenting membranes was achieved. The publication [29] of 2021proposes a more complex network structure called hierarchical view-ensemble convolutional (HVEC) network as an alternative to a simple 3D structure. This structure inherits the abovementioned idea of [16] with three levels of resolution reduction and additional outputs for each level; next, the resolution reduction architecture is completed with a branch of resolution increase, which is typical for U-Net.

The application of artificial intelligence method for EM data processing is largely hampered by a small amount of labeled data for training and testing DNNs. Open EM data as a whole are represented by only a few labeled dataset, both due to the laboriousness of preparing samples for an electron microscope, and due to the lack of specialists for manual labeling. We found four open EM datasets the earliest and most popular of which are labeled only for one class (mitochondria or membranes). In the two other datasets, several classes are distinguished. As a result, the majority of neural networks used in EM processing are trained only to perform binary segmentation.

In connection with the above, the main aim of this work is to (1) create a dataset with manual multiclass labeling for a list of classes that provides a solution to the main modern tasks of EM data segmentation; (2) to develop algorithms for automatic generation of a dataset of synthetic objects of the specified main classes and create a dataset of synthetic objects, primarily those objects that are scarcely represented in the traditional datasets; (3) to study the capabilities of multiclass segmentation of U-Net-like architectures, starting with U-Net (in this work), using datasets with manual labeling and additional synthetic labeling.

2 DATA AND METHODS

In this section, we describe publicly available datasets. The most popular datasets for assessing the segmentation of mitochondria were collected by Lucchi et al. in [3].

It is seen that in three of the four labeled open datasets, only one class is labeled. Only one dataset contains more than one labeled class. For this reason, the vast majority of neural networks in EM are trained to classify only two classes (object and background).

We used the dataset EPFLВ or the data set of mitochondria segmentation Lucchi available at https://www.epfl.ch/labs/cvlab/data/data-em/. Initially, these data contain masks only for mitochondria. For this reason, to assess multiclass segmentation algorithms, we manually labeled 20 layers in the training sample (1024 × 768) and three layers for the following classes: (1) mitochondria, including their boundaries; (2) boundaries of mitochondria; (3) cell membranes; (4) postsynaptic densities (PSD); (5) axon sheaths; and (6) vesicles.

Accurate manual labeling of one layer takes 5–8 hours. Our labeling of the dataset EPFL is available at https://github.com/GraphLabEMproj/unet. We plan to continue the work on labeling and do this for both datasets. An example of labeling a layer fragment is shown in Fig. 1.

Fig. 1.
figure 1

Example of manual image labeling carried out by the authors: (а) original image, (b) membranes, (c) mitochondrion with its boundaries, (d) boundaries of mitochondria, (e) PSD, (f) vesicles, (g) axon sheath.

It just so happens that the axon sheath in the training dataset is present only in the first 36 layers and looks completely different from the axon sheath in the test dataset (Fig. 2). In the test dataset, the axon is represented in the first 70 layers, changes its shape for elongated to more rounded, and also has a darker interior and inner ring.

Fig. 2.
figure 2

Axon sheaths in the training and test EPFL datasets: (а) axon sheath in the training set; (b) axon sheath in the test set, fist layer; (c) axon sheath in the training set, 35th layer; (d) axon sheath in the test set, 70th layer.

For the synthesized dataset, we generated 100 images of size 256 × 256 pixels containing the least represented classes—postsynaptic densities and axon sheaths. An example of data is shown in Fig. 3. The program for data generation is written in C#. The shape, size, and gray levels of compartments are chosen to be similar to the shape, size, and gray levels of the test dataset EPFL. To make the generated images more similar to real-life images, these images was blurred with a Gaussian filter with a kernel of radius seven, and Gaussian noise with a level of 20 was added. The advantage of a synthetic set is that you can get any number of images you need along with their labeling automatically.

Fig. 3.
figure 3

Example of synthesized data (only nonzero masks are shown): (a) layer, (b) mask of axon sheaths, (c) mask of membranes, (d) PSD mask.

2.1 Network Architecture

U-Net is considered to be a standard convolutional network architecture for image segmentation tasks. This architecture consists of a contracting path for capturing the global context and a symmetric expanding path that enables accurate localization. The basis of this network is the project U-Net https://github. com/zhixuhao/unet. In the original project, U-Net was used for binary classification of membranes. In this work, we use U-Net for multiclass segmentation. We copied the original repository, and made modifications in it, which are available at https://github.com/GraphLabEMproj/unet together with our labeling of the Lucci data.

Following the author of the code at https://github.com/zhixuhao/unet, the implementation of U-Net has some differences from the classical U-Net network [7]:

• The network input is an image of size 256 × 256 × 1.

• The network output is 256 × 256 × \(num\_classes\), where \(num\_classes\) is the number of classes.

• The sigmoid activation function guarantees that the mask is in the range [0, 1].

In addition, we added batch normalization after each ReLU convolution and activation layers.

3 EXPERIMENTAL RESULTS

3.1 Assessment Criteria

We use the Dice–Sørensen coefficient (DSC) and Jaccard’s coefficient (JAC), which are usually used for segmenting biomedical images. Define the number of correctly classified pixels as belonging to the target class (true positive) TP, the number of correctly classified background pixels (true negative) TN, the number of erroneously classified pixels as belonging to the target class (false positive) FP, and the number of erroneously classified background pixels (false negative) FN. Then, define the metrics as follows:

$$DSC = \frac{{2TP}}{{2TP + FP + FN}},$$
$$JAC = \frac{{TP}}{{TP + FP + FN}}.$$

The values of the DSC and JAC vary from zero to one. By contrast with Jaccard’s coefficient, the corresponding difference function is not a correct distance metric since it does not satisfy the triangle inequality. JAC and DSC are equivalent in the sense they may be represented in terms of each other:

$$DSC = \frac{{2JAC}}{{1 + JAC}}.$$

Since we consider multiclass segmentation in this work, we are interested in multiclass metrics. Since the Jaccard (or Dice) metrics compare two sets, in the case of multiclass classification the result will be a vector of Jaccard (or Dice) metrics for each class. For training a neural network, a scalar error function is used. Therefore, for multiclass segmentation, we should convolve the metric vector. To convolve a vector into a scalar, we use the linear convolution

$${{W}_{{scalar}}} = \sum\limits_{i = 1}^N {{{\lambda }_{i}}} {{W}_{i}},\quad {{\lambda }_{i}} \geqslant 0,\quad \sum\limits_{i = 1}^N {{{\lambda }_{i}}} = 1,$$

where \({{\lambda }_{i}}\) is a weighting coefficient and \({{W}_{i}}\) is the value of the distance coefficient for the ith class. \({{W}_{{scalar}}}\) is a scalar value or convolution of a distance vector, and N is the number of classes.

In this work, we use the linear convolution of DSC with the weighting coefficients \({{\lambda }_{i}}\) equal to 1/N.

3.2 Experiments

To obtain a new training sample, twenty high-resolution images of the original training sample were cut into 256 × 256 fragments with an overlap of a quarter of the fragment size. In total, 860 fragments were obtained. To additionally increase the training sample, we made random rotations of images, random shifts, and random scale changes in a small range (5%).

To obtain a mixed training sample, we added to the original 860 fragments 100 synthesized fragments; thus, in total we have 960 fragments.

We selected 20% of images from the training sample into a validation sample with the batch size equal to seven. The model was tested on three layers (129 fragments). We used Adam’s optimizer with the training rate of 2 × 10–5. The training curves for different experiments are presented in Figs. 4 and 5.

Fig. 4.
figure 4

Learning curves for the original dataset; from left to right learning curve for six-class segmentation, learning curve for five-class segmentation, and learning curve for binary segmentation of mitochondria.

Experiment 1. Five segmentation classes—mitochondrion with its boundary, membranes, PSD, axon sheaths, and vesicles. The number of epochs is 1000.

Experiment 2. Six segmentation classes—mitochondrion with its boundary, boundary of the mitochondrion, membranes, PSD, axon, and vesicles. The number of epochs is 1000.

One more class of mitochondria boundaries is added.

Experiment 3. One segmentation class—mitochondrion with its boundary. The number of epochs is 200.

It is seen from Table 2 that the quality of multiclass segmentation is only slightly inferior to binary segmentation.

Table 2. Results of electron microscopy data segmentation for the original dataset (ORG) and for the dataset enriched with synthesized images (SYN); Dice coefficient is used as the metric

The class mitochondria boundaries is a sublass of the class mitochondria with their boundaries, and the additional edge enhancement improves the segmentation results of the unifying class. The network was trained on unbalanced classes, since the sizes of compartments and their occurrence differ by dozens of times.

4 DISCUSSION

In this section, we discuss Table 3 “Comparison of mitochondria segmentation results,” in which we placed the most representative results on membrane segmentation using binary and multiclass models.

Table 3. Comparison of mitochondria segmentation methods

We tested our models on the entire dataset EPFL and used these values instead of the results presented in Table 2. We cannot directly compare the results in Table 3, since our models were trained on a significantly reduced version of EPFL. However, we can put forward several hypotheses that need to be tested. The worst results were obtained in layers containing an axon, fuzzy membranes, incomplete mitochondria, mitochondria with darker borders and darker inclusions than on labeled layers.We assume that labeling more layers or generating synthetic data with proper characteristics will improve the results.

5 CONCLUSIONS

We manually carried out the multiclass labeling of 20 layers of the training set and three layers of the test set for the well-known dataset EPFL, which includes the following classes: (1) mitochondria, including their boundaries; (2) boundaries of mitochondria; (3) membranes; (4) postsynaptic densities (PSD); (5) axon sheaths; and (6) vesicles. Software for generating synthetic labeled datasets with the same classes was developed. A synthetic labeled dataset that includes axons, PSD, and membranes was created.

Results of segmentation of multiclass brain electron microscopy data obtained using a modified U-Net with decomposition of data layers into 256 × 256 fragments while preserving the original resolution are presented.

The study showed that the results of binary, five-class and six-class segmentation are similar in quality: 0.911, 0.910 and 0.908, respectively. The quality of segmentation is affected by the presence of a sufficient number of specific features that distinguish the selected classes, and the representation of these features in the training sample.

The expansion of datasets by synthesized images improves the classification results. The expansion of a manually labeled dataset (860 256 × 256 images) by a synthesized dataset consisting of (100 256 × 256 images containing less represented classes (axons, PSD, and membranes) significantly improved the accuracy of the six-class model (see Table 2)—from 0.228 to 0.790, from 0.553 to 0.745, and from 0.743 to 0.750, respectively, in proportion to the liquidated deficit.