Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

In computer vision, the goal of segmentation is to simplify and/or change the representation of an image into something that is more meaningful and easier to analyze. Image segmentation is typically used to locate objects and boundaries in image [1]. More precisely, image segmentation is the process of assigning a label to every pixel in an image such that pixels with the same label share certain common characteristics [2]. Many applications require image segmentation, such as content-based image retrieval, machine vision, medical imaging [3], object detection [4], recognition tasks [5], control systems, and video surveillance.

Computer-aided image analysis systems can enhance the diagnostic capabilities of physicians and reduce the time required for accurate diagnosis [6]. As one of the major techniques, medical image segmentation plays a significant role in clinical diagnosis. It is considered challenging because medical images often have low contrast, various types of noise, and missing or diffused boundaries [7]. Research efforts have been devoted to processing and analyzing medical images to segment meaningful information such as volume, shape, and motion of organs, to detect abnormalities, and to quantify changes in follow-up studies [8]. Many image segmentation techniques are available in the literature. Some use only the gray level histogram [1] or spatial details and others use fuzzy set theoretic approaches. Most of these techniques are sensitive to noise, and thus not suitable for medical imaging. The Markov Random Field model is robust to noise, but involves a huge amount of computations [9]. Manual segmentation is an expensive, time consuming task. It is subject to manual variation and subjective judgments, which increases the possibility that different observers will reach different conclusions about the presence or absence of tumors. Even the same observer will occasionally reach different conclusions on different occasions [10]. An efficient and consistent medical image segmentation algorithm would help avoid these confusions.

Deep learning algorithms have shown remarkable results in various image processing fields for most benchmark image datasets including MNIST (classify handwritten digits) [11], CIFAR-10 (classify \(32\times 32\) color images for 10 categories) [12], CIFAR-100 (classify \(32\times 32\) color images for 100 categories) [13], STL-10 (similar to CIFAT-10 but with \(96\times 96\) images)[14], and SVHN (the street view house numbers dataset)[15], etc. Convolutional Neural Networks (CNNs), as a milestone model of deep learning, are driving advances in image analysis. CNNs not only improve the performance of whole-image classification, but also make progress on extracting features. CNNs make a prediction for every pixel and are able to take the advantage of the detailed features of an object image. Krizhevsky et al. made a significant improvement in image classification accuracy on the ImageNet large-scale visual recognition challenge (2012) [16]. Different from traditional image processing methods (e.g. SIFT [17], HOG [18], etc.), which involve a hand-crafted feature descriptor, CNNs are deep architectures for learning features. All the features are learned hierarchically from pixels to classifier, and each layer extracts features from the output of previous layers [19]. However, to obtain superior performance, CNNs usually require a large-scale training process. To collect an abundance of medical images is costly and not feasible. The training process also consumes too much time and resources to provide manually annotated training datasets.

In this paper, we propose a brand new concept on how to use CNNs for brain image segmentation with implicit features that link medical imaging to deep learning. We divide training images into regions and label them automatically to boost the size of the training dataset. A CNN learning framework is designed to capture the local structure of the ROIs and automatically learn the most relevant features.

After a brief introduction to the background, the problem formulation along with the data generation is provided in Sect. 2. In Sect. 3, we present the details of a CNN architecture. Section 4 shows the results and includes discussion. Finally, the paper is summarized and concluded with future research directions in Sect. 5.

2 Region-Based Segmentation

Image segmentation is a process of assigning a label to every pixel in an image such that pixels with the same label share certain characteristics. Therefore, assigning pixel labels using CNNs based on the features obtained from the image data is a reasonable strategy for segmentation. The main highlight of a deep learning algorithm is that all features are learned from the image data directly. The neural network architecture has more than 60 million parameters, which makes training on GPUs a necessity. A straightforward way to improve the performance of CNNs is by increasing the size of training data. Acquiring such data is not always feasible for medical imaging. In order to take the advantage of CNNs to obtain accurate segmentation, we propose a method that can solve the limited training data problem from which CNNs generally suffer. Figure 1 presents an overview of our method. After boosting the size of training data, we perform the stochastic gradient descent (SGD) training of CNN parameters using this large dataset. The result is a customizable segmentation operation whose performance and behavior reflect the segmentation criteria learned directly from the training data. The proposed method is composed of three main steps:

  1. 1.

    Generate enough training data from the limited original data

  2. 2.

    Label data efficiently

  3. 3.

    Augment the dataset

Fig. 1.
figure 1

Segmentation system overview. (1) Brain MRI input image, (2) region extraction, (3) feature computation for each region using a convolutional neural network, and (4) region classification to detect ROI pixels.

2.1 Generate Training Data

In dealing with Magnetic Resonance Imaging (MRI) images, one of the most challenging aspect is the process of partitioning some specific cells and tissues from the rest of the image. An MRI image from our dataset is shown in Fig. 2(a). Experienced doctors segment out the tumor area (white area in the lower half) in the image manually to get the binary ground-truth segmentation as shown in Fig. 2(b) (zoomed in for clarity). In Fig. 2(b), pixels in the tumor area are set to black and pixels in the background are set to white. In order to provide enough annotated training images for CNN, a sliding window of \(n\times m\) pixels is applied to extract small regions’ proposals. These patches from one image sample are used for training.

Fig. 2.
figure 2

An overview of data generation. (a) The original brain MRI image, (b) zoomed in segmentation labeled by doctors, (c) positive sample which tightly encloses the ROI pixels, (d) negative sample which is background pixels (e) positive sample which ground-truth segmentation line (the boundary) falls within the central area, (f) negative sample which the boundary does not pass through the central box.

2.2 Label Training Data

To label the patches obtained with a sliding window, the image regions tightly enclosing ROI pixels are regarded as positive examples (e.g. window c in Fig. 2(b), while the non-ROI regions, which have nothing to do with the tumor area, are treated as negative examples (window d). Regions that partially overlap the ROI are treated with a central area overlapping process. The ground-truth segmentation is also regarded as positive example if it goes through the central area of a region’s proposals. Otherwise, the region is considered as negative. Figures 2(c–f) show the zoomed in areas in four boxes in Fig. 2(b).

2.3 Augment Positive Data

After data generation and labeling, we convert one image into a large dataset which includes around 300 positive regions and 20, 000 negative regions. Because the size of the positive training set is much smaller than the negative training set, data augmentation is necessary to improve classification performance. According to the characteristics of our data, we choose 2 ways flips (horizontal and vertical) and 35 rotations (10\(^{\circ }\), 20\(^{\circ }\), 30\(^{\circ }\), ..., 340\(^{\circ }\), 350\(^{\circ }\)). Here, we rotate the whole original image and ground truth image by different angles, then use the same algorithm mentioned above to obtain the positive patches. After this step, we increase the positive training examples from 300 to 11, 400.

3 CNN Architecture and Model Learning

The architecture of the CNN used in this paper is illustrated in Fig. 3. This CNN has three convolution-max pooling layers followed by a 2-way softmax output layer. The CNN is configured with Rectified Linear Units (ReLUs), as they train several times faster than their equivalents with tanh connections. This section articulates details of those layers. \(21\times 21\) patches were used as data for the CNN in this study. Patches are all gray images, and 1 channel is used for the input data. The first convolutional layer uses 96 kernels of size \(5\times 5\) with a stride of 4 pixels and padding of 2 pixels on the edges, followed by a \(3\times 3\) max pooling layer with a stride of 2. A Local Response Normalization (LRN) layer is applied after the first pooling layer. The second convolutional layer uses 128 filters of size \(3\times 3\) with a stride of 2 pixels and padding of 2 pixels on the edges. A second pooling layer has the same specification as the first one. The third convolutional layer uses 128 filters of size \(3\times 3\) with stride and padding of 1. The third pooling layer also has the same configuration as the two before it and leads to a softmax output layer with two labels corresponding to ROI pixel (1) and non-ROI pixel (\(-1\)) classes.

Fig. 3.
figure 3

Illustration of Convolutional Neural Network (CNN) architecture

4 Experiments and Discussions

The algorithm learned the segmentation model after all regions were trained using our CNN. We tested our segmentation algorithm on different brain MRI slices. Our goal was to output a same size binary segmentation image similar to the ground-truth image the doctors segmented manually. For the test images, we applied a sliding window of the same size \(21\times 21\) to obtain the region proposals, then forward propagated the proposal through the CNN model in order to determine the class to be positive or negative. We recorded the location of each region’s central pixel in the original image for constructing the binary segmentation. If the region was classified as positive, which means the center of this region is considered as a ROI pixel, the central pixel was set to 0 (black) in the segmentation output image. Otherwise, the central pixel was considered a non-ROI pixel or background and was set to 1 (white).

In clinical MRI applications, transverse plane, coronal plane, and sagittal plane are three main planes of the body used to describe the location of body parts in relation to one another. The transverse plane is a horizontal plane that divides the body into superior and inferior parts, the coronal plane is any vertical plane that divides the body into ventral and dorsal sections, and the sagittal plane is any vertical plane which divides the body into right and left halves. Scans of different plane vary significantly. We trained different models for different planes in Sect. 4.1. We also experimented with creating one general model to detect tumors in images from all three scans. Since our model is based on deep learning, this challenge is easily addressed by extending the training data set to cover all three cases, The results are shown in Sect. 4.2.

In general, a primary brain tumor has only one large lesion. It is usually associated with extensive local edema and is easy to be detected. Whereas, a secondary brain tumor usually has several very small lesions without local edema and is hard to be detected. We chose images with secondary brain tumors as our test samples to demonstrate the superiority of our method. All images chosen for study had small brain tumors, and they were not all visible in all slices. Because of the limitation of the medical image resource, in our experiments, we used the MRI images from 5 patients (A–E). The patients’ information is listed in Table 1. We picked the slices in which the tumor can be seen and labeled by experienced doctors.

Table 1. The information of patients

In each SGD iteration of our training, we uniformly sample 32 positive regions and 32 negative examples to construct a minibatch of size 64. We biased the sampling towards positive regions, because they are extremely rare compared to the background or negative regions. The CNN portion of our experiments used Caffe framework [20] running on the NVIDIA Kepler series K40 GPUs. We used Matlab to produce the final segmentation results. The CNN model presented in Fig. 3 was trained using region images several times to increase its ability to automatically detect ROI pixels in any test image with a variant resolution.

4.1 Three Plane Models

We performed two experiments for each plane. In the first experiment (Test 1, 3 and 5), we used different slices from the same patient for training and testing. For the second experiment (Test 2, 4 and 6), we picked slices from multiple patients (excluded testing patient) for training. Table 2 shows the detail of the experiments’ setup. Experiment results are shown in Figs. 4 and 5. All training images are listed in the Appendix A. To evaluate the performance, we compared our method with Otsu’ method [21]. Test 1 shown in Fig. 4(a–d) used one slice for training. The algorithm was able to detect the tumor areas accurately although with some noise. Segmentation result for Test 2 shown in Fig. 4(g) is almost identical to the doctors’ labeling. Result of Test 2 was better than Test 1 mostly because more slices were used for training. Results of Test 1 to 2 show the algorithm was able to effectively and accurately locate the tumor.

Table 2. The details of experiments’ setup
Fig. 4.
figure 4

Transverse plane segmentation. Test 1: (a) One transverse slice from patient A as the testing image, (b) ground-truth of (a), (c) our segmentation, (d) Otsu’s segmentation. Test 2: (e) One transverse slice from patient B as the testing image, (g) ground-truth of (e), (g) our segmentation, (h) Otsu’s segmentation.

Fig. 5.
figure 5

Coronal and sagittal plane segmentations. (a) Testing image in Test 3 and 4, (b) segmentation of (a) labeled by doctors, (c) our segmentation in Test 3, (d) our segmentation in test 4, (e) Otsu’s segmentation of (a), (f) testing image in Test 5 and 6, (g) segmentation of (f) labeled by doctors, (h) our segmentation in Test 5, (i) our segmentation in Test 6, (j) Otsu’s segmentation of (f).

Compared with Otsu’s method, our method was able to distinguish boundary pixels of the skull from tumor pixels. However, Otsu’s method failed to differential tumor area from the skull boundary. So we had to apply morphological post-processing to remove the boundary of the results for comparison (‘Otsu-p’) as shown in Table 3. As mentioned before, since we chose \(3\times 3\) central window, this would dilate the final result. For a fair comparison, we also applied a simple erosion method to our raw result, where ‘Ours-1’ means the erosion operation was applied once, and ‘Ours-2’ means the erosion operation was applied twice. The comparison performance is presented in Table 3.

Test 3 and 4 used the same image Fig. 5(a) to test, and both can locate the tumor with high recall scores listed in Table 3. However, for Test 4, there are only two patients whose tumors could be seen in coronal plane scan. Since the training data were scarce and much different from the test image, the result of Test 4 presented in Fig. 5(d) showed some contour noise which can be removed by simple post processing techniques, e.g. ‘Ours-2’ boosted the precision score to 0.64 from 0.29. Tests 5 and 6 show our method has a strong response for the two tumor areas, which outperformed Otsu’s method. Test 6 has better recall and precision scores than Test 5, since Test 6 took use of more training images. Better accuracy could be obtained if more training data were available.

Table 3. Comparison results in Test 1–6

As shown in the Table 3, the proposed segmentation algorithm has the best recall score in every Test except Test 2. Otsu’s method with post processing performed better in this case. However, its precision is pretty low. For Precision, our method with simple post processing performed the best.

From the perspective of running speed, one pass for our CNN model takes close to 1 ms. Each pass can be done individually to take the advantage of parallel processing. Whereas, the Otsu’ method takes one whole second and its computation cannot be parallelized. Our method has great potential to be implemented in hardware for real time segmentation.

4.2 General Model

We selected one slice from each of three scans from patient B as the test image for Test 7. The general model was trained using one slice of each scan that tumor areas were visible from all other patients. The results for this experiment are listed in Fig. 6. We also compared our results with Otsu’ method shown in Table 4. We observed noisier segmentation using a general model than using a specialized model, but the general model was able to segment tumor on all three planes. Table 4 shows our methods have the best recall and precision scores.

Fig. 6.
figure 6

Test 7 results, (a–c) transvers plane, (d–f) coronal plane, (g–i) sagittal plane.

Table 4. Comparison results in Test 7.

5 Conclusion

In this paper, we propose to use a Convolutional Neural Network for brain MRI image segmentation. We train a CNN with ROIs and non-ROIs patterns iteratively so that it is able to automatically segment tumor areas effectively. Our result is very promising. Since all features are learned from labeled data, our model is able to accurately locate the tumor areas.

Our motivation for this work is not to replace any existing well-known segmentation methods. This work proposes an interesting concept that could be improved further. Since all important information for segmentation is learned from the data labeled by the experts, our model has demonstrated its capability of mimicking the expert’s segmentation style represented in the ground truth. Other segmentation methods require fine tuning the parameters manually for different applications. An advantage of the proposed method is its flexibility and potential to adapt for different applications or imaging modalities without any modifications of the algorithm. Unlike the traditional deep learning methods that require a large scale of training data, which is often not feasible for medical image applications, this algorithm requires only a small set of training images and the ground truth.

The proposed method trains the CNN model with only a couple of images. Training with more images will further improve its performance. Because our dataset size is small, starting with a pre-trained model would also improve its performance. Meanwhile, different experimental settings might change the performance, which needs to be investigated in the future.