1 Introduction

Lower back pain is considered as one of the most common neurological ailments in the United States and as such costs associated to lower back pain form a significant portion of the total annual spending on healthcare. Degeneration of intervertebral discs (IVDs), as caused by aging, trauma, mechanical load, nutritional or genetic factors, is a common underlying cause of lower back pain. The degree of degeneration is typically assessed by means of magnetic resonance imaging (MRI), given the superior ability of MRI to distinguish between soft tissues and its absence of ionizing radiation. Because of significant inter-observer variation in grading IVD degeneration and constantly increasing workloads for radiologists, more and more research is devoted to develop computer assisted diagnosis systems to support the radiologists in their work and thereby improving the diagnostic process [1]. A crucial initial step in such a system is an accurate segmentation of the IVDs.

Over the last years, a number of different methods have been proposed for segmentation of IVDs [16]. The performance ranges from a mean DICE score of a mediocre 74 % to an impressive 92 %. However, a drawback of some referenced methods is the limitation to two-dimensional (2D) image data, as typically given by the mid-sagittal slice of an image volume covering the lumbar spine. In addition, thus far no comparison between methods has been possible since all have been evaluated on different data sets. To this end, that is promoting three-dimensional (3D) segmentation of IVDs along with enabling a valid comparison of different methods, an IVD segmentation challenge was set up and organized in conjunction with the 3rd MICCAI Workshop & Challenge on Computational Methods and Clinical Applications for Spine Imaging - MICCAI–CSI2015.

In this paper, we present one of the challenge participants of said challenge. As such, the presented method is capable of both localizing and segmenting IVDs in MRI data. The method builds upon earlier work as presented by Lootus et al. [7] for vertebral body detection and labeling in MRI data and by Forsberg [8] for multi-atlas based segmentation of vertebrae in computed tomography data. The two approaches are combined and adapted to the task of localization and segmentation of IVDs in MRI data. Results are presented and discussed as pertaining to the training and test data provided for the challenge.

2 Materials and Methods

2.1 Image Data

The image data provided for training/testing and initial/final evaluation consisted of MRI data from 15 respectively 10 subjects, where each subject had been scanned with a 1.5 T scanner (Siemens Magnetom Sonata, Siemens Healthcare, Erlangen, Germany) and form a subset of the data used in the work of Chen et al. [2]. The image data consisted of sagittal T2-weighted turbo spin echo image volumes with a spatial resolution of \(2.00\,{\times }\,1.25\,{\times }\,1.25\) mm\(^3\) and a size of \(39\,{\times }\,305\,{\times }\,305\) or \(48\,{\times }\,304\,{\times }\,304\). The IVDs have been manually segmented using the original sagittal images. Examples of the image data and ground truth segmentations are given in Fig. 1.

Fig. 1.
figure 1

Example of data used for the evaluation along with ground truth segmentations.

2.2 Segmentation Pipeline

The proposed segmentation pipeline consists of the following three components, detection and labeling of vertebral bodies, multi-atlas based segmentation per IVD and finally label fusion. Note that the presented method is a 3D-based method providing an accurate segmentation in 3D, even though the initial detection and labeling step is performed on individual 2D images.

2.3 Detection and Labeling

Detection and labeling of vertebrae is a challenge of its own and a number of methods have been presented in recent years. In our pipeline, it was decided to mimic the approach presented by Lootus et al. [7]. However, we employed aggregated channel features [9] coupled with an AdaBoost classifier for detection of the individual vertebrae. The reasons for not choosing the deformable parts model based on histogram of oriented gradients, as employed by Lootus et al. [7], were twofold. Firstly, it did not perform as well as the chosen approach, and secondly, it was more computationally demanding. Similar to the work of Lootus et al. [7], two different detectors were trained, one general vertebra detector and one for the fused S1 and S2 segments of the sacrum. To remove a significant portion of the false positive detections, a greedy non-maxima suppression algorithm was employed. In order to improve the performance of the detector, by increasing the number of detections, a set of sagittal slices (the three mid-sagittal slices) were fed as input to the vertebra detector. The individually detected objects are then combined using a pictorial structures model [10], further removing false positive detections along with labeling the detected vertebra. The object vertebra detectors along with the graphical parts model had previously been trained on a separate data set. Figure 2 provides a few examples depicting the output from the detection and labeling step.

The output from the detection and labeling consisted of a pair of y and z coordinates along with a label for each detected and labeled vertebra (assuming a coordinate system where x goes from right to left, y anterior to posterior and z inferior to superior). Corresponding x coordinates for each detection and labeling were simply set to the x coordinate of the mid-sagittal slice in the data set. Note that this works well as long as the orientation of the image volume is parallel with the spinal column and the subject has limited spinal deformities in the coronal plane.

Fig. 2.
figure 2

Example results from the detection and labeling of vertebra. The detection and labeling works well even in cases where the sacrum is not fully depicted (c).

2.4 Multi-atlas Based Segmentation

The output from the previous step provided a set of landmarks denoting the centerpoints of vertebra T11 to L5 along with the centerpoint of sacrum S1-S2. These centerpoints were used to provide rough estimates of the centerpoints of the corresponding IVDs, simply by using the midpoint between the centerpoints of two consecutive vertebrae. This served as input to the image registration in which a registration per disc was performed by extracting a sub-block (sized \(40\,{\times }\,96\,{\times }\,96\) voxels, determined empirically to ensure a good coverage of each disc) of the data from each volume centered around the respective centerpoints of each disc. Note that for a general segmentation pipeline, the size of the sub-blocks should preferably be set in millimeters and with the possible extension to scale the size depending on sex, age and length of the patient. Each disc of the image data to segment is registered with multiple atlases. The registration was executed in two steps, where an initial affine registration was performed to account for differences in size and pose, and where a subsequent deformable registration was applied to account for local differences in shape. In both cases, local phase-based image registration approaches were applied.

Affine Registration. The affine phase-based registration was defined as an \(L_2\) norm:

$$\begin{aligned} \epsilon ^2 = \frac{1}{2}\sum _k {\sum _\mathbf{x \in \varOmega } {c_{k}(\mathbf x ) \left[ \nabla \varphi _k{(\mathbf x )}^T B(\mathbf x ) \mathbf p - \varDelta \varphi _k (\mathbf x ) \right] ^2}}, \end{aligned}$$
(1)

where \( \varphi _k \) refers to the local phase-difference in orientation \({\varvec{\hat{\mathbf{n }}}}_k\) between the two images to be registered, \(c_k\) is a measure of certainty related to \(\varphi _k\), and \(B(\mathbf x )\mathbf p \) corresponds to a linear parameterization of the local displacement \(\mathbf d (\mathbf x )\). A more detailed description is found in the works of Hemmendorff et al. [11] and Eklund et al. [12], and provides details on the employed graphics processing unit (GPU) implementation.

Fig. 3.
figure 3

Example results from the atlas-based registration. The individual segmentations as provided by the atlas-based segmentation are highly irregular and far from perfect, for example note the top disc in (a).

Deformable Registration. Similarly, as for the affine registration, a voxel-wise \(L_2\) norm based upon local phase-differences was defined for the deformable counterpart:

$$\begin{aligned} \epsilon ^2(\mathbf x ) = \sum _{k} \left[ c_{k}(\mathbf x ) \mathbf T (\mathbf x ) \left( \varphi _{k}(\mathbf x ) \varvec{\hat{\mathbf{n }}}_{k} - \mathbf u (\mathbf x ) \right) \right] ^{2}. \end{aligned}$$
(2)

In this case, \(\varphi _k\) and \(c_k\) are as before, and \(\mathbf T \) refers to local structure tensor.

Solving for \(\mathbf u \) provides a voxel-wise update field \(\mathbf u (\mathbf x )\), which can be iteratively regularized and added to the final displacement field \(\mathbf d \). Details on the registration algorithm can be found in the works of Knutsson and Andersson [13], and Forsberg et al. [14, 15] for the employed GPU implementation.

Examples of output from the whole atlas-based segmentation step are shown in Fig. 3.

2.5 Label Fusion

The final step is to merge the labels of the multiple deformed atlases into a single label volume. In this case, a modified majority voting has been employed for label fusion, where instead of a standard majority vote only a minimum number of votes were required to render a valid segmentation. The reason for this approach was that since the discs are well-separated, it is only the background that provides a competing label. The minimum number of votes required for a segmentation was set to five, a number which was empirically determined. Example visualizations of the final segmentations are given in Fig. 4.

Fig. 4.
figure 4

Example results from the final step of label fusion. Previous irregularities are now gone and no apparent errors in the final segmentations are visible.

2.6 Evaluation

Given that 15 data sets were available for training, including ground truth data, a leave-one-out evaluation was performed for the training data in which one data set is segmented using the 14 others as atlases. This is then repeated for each available data set. Both IVD localization and segmentation results were included in the evaluation. For the evaluation on the test data, all 15 data sets were used as atlases.

The evaluation of the test data was performed in a two-step process, with five data sets released prior to the challenge for an off-site evaluation and with remaining five data sets released on the day of the challenge for an on-site evaluation. The challenge organizers only provided results in terms of mean disc centroid distance and DICE score.

Localization. For each segmented disc, the disc centroid distance was computed as the Euclidean distance between the centroid of the ground truth IVD and the centroid of the segmented IVD obtained from the presented method. Based upon the disc centroid distance a disc localization was considered as successful if the distance was less than 2 mm.

Segmentation. The ground truth data was compared with the segmentations obtained from the multi-atlas based segmentation using the DICE score. The DICE score is defined as:

$$\begin{aligned} DICE = \frac{2*|GT \cap S|}{|GT|+|S|}, \end{aligned}$$
(3)

where GT and S refer to the ground truth and the computed segmentations respectively, and \(|\ldots |\) denotes the number of voxels, i.e. no respect was given to the anisotropic resolution.

Table 1. Average DICE score, and false negative (FN) and false positive (FP) ratios per disc for the training data.
Table 2. Average DICE score, and false negative (FN) and false positive (FP) ratios per subject for the training data.

To complement the agreement measure of the DICE score, false negative (FN) and false positive (FP) ratios were also computed as:

$$\begin{aligned} FN = \frac{{|GT \setminus S|}}{{|GT|}} \end{aligned}$$
(4)

respectively

$$\begin{aligned} FP = \frac{{|S \setminus GT|}}{{|S|}}. \end{aligned}$$
(5)

3 Results

3.1 Training Data

The mean disc centroid distance was \(0.86\,{\pm }\,0.45\) mm. Only three disc centroid distances were larger than 2 mm (2.05, 2.91 and 2.06 respectively) providing an IVD detection rate of 97%.

The average achieved DICE score was \(0.91\,{\pm }\,0.01\) along with an average FN and FP of \(0.08\,{\pm }\,0.02\) and \(0.09\,{\pm }\,0.02\) respectively. Detailed results per disc and subject are given in Tables 1 and 2 respectively.

3.2 Test Data

For the off-site evaluation the mean disc centroid distance was \(0.81\,{\pm }\,0.42\) mm and a detection rate of 97 %. Corresponding results for the on-site evaluation was \(0.99\,{\pm }\,0.78\) mm and a detection rate of 80 %. The achieved DICE score was \(0.90\,{\pm }\,0.03\) for both the off-site and on-site evaluation.

4 Discussion

In this paper, we have presented one of the methods participating in the IVD segmentation challenge, organized in conjunction with MICCAI–CSI2015, a method relevant for both robust IVD detection and accurate IVD segmentation. The method has been evaluated using training and testing data provided by the challenge organizers. Performance was assessed using disc centroid distance, the DICE score along with computing the ratios of false negatives and false positives (the latter two only computed for the training data).

The presented method achieved a mean disc centroid distance of \(0.86\,{\pm }\,0.45\) mm, with a success rate of 97 % given a threshold of 2 mm for the training data. This can be compared with a mean disc centroid distance of 2.08 mm as reported by Ghosh and Chaudhary [3] for the 2D case, 1.23 mm by Law et al. [4] and \(1.6\,{-}\,2.0\) mm by Chen et al. [2] (both the latter for the 3D case). As such, the presented results are superior to earlier results. Note that the results obtained for the on-site evaluation was somewhat lower than for the training data and for off-site evaluation.

In terms of segmentation accuracy, the presented method performs on par with current state-of-the-art methods for IVD segmentation. For example, Michopoulou et al. [5] achieved an impressive mean DICE score of 0.92, however, only for segmentation of 2D image data and using manual interaction for performing the initial atlas-based registration. Similar DICE scores were presented in the work of Law et al. [4], but again only for 2D image data. Neubert et al. [16] presented an extension of their initial work [6] using multi-level statistical model and achieved a mean DICE score of 0.91 on 3D data.

In Tables 1 and 2 it can be noted that the segmentation accuracy is stable over both discs and subjects, i.e. there exists no failed cases and any future improvements are, thus, rather related to fine-tuning of parameters than making major changes in the presented pipeline. The ratios of false negatives and false positives show that there appear to be an equal distribution of under- and over-segmentation between discs and subjects.

Limitations of the presented results are given by the small size of the data set employed for the evaluation along with its homogeneity. Further, the data set lacks in cases of degenerated IVDs, hence, it is difficult to foresee the performance of the presented segmentation pipeline on more clinically relevant data, including a variety of degenerated IVDs. Another limitation is given by the dependence of the registration step on the detection and labeling step. A missed vertebra in-between other vertebrae can be handled with some additional heuristics to account for long distance between detected vertebra. In the case the detection and labeling is off by one or two labels, i.e. in the case when S1-S2 is missed and instead L5-S1 is labeled as S1-S2, the segmentation is still expected to perform well but neglects to segment the most inferior IVD.

In all, the presented method along with the evaluation results, a mean disc centroid distance of \(0.86\,{\pm }\,0.45\) mm and an average DICE score of 0.91 for the training data, show that robust localization and accurate segmentation of IVDs are achievable.