Keywords

1 Introduction

The task dealt with in this paper is the following: given an MR of the lumbar spine, localize and label all the vertebrae present in that image. The motivation for this work is that spine appearance, shape and geometry measurements are necessary for abnormality detection locally at each disk [27] and vertebrae [8, 9] (such as herniation), as well as globally for the whole spine (such as spinal scoliosis).

Fig. 1
figure 1

The task. Given a 3D MR lumbar spine image comprising of a stack of sagittal 2D slices as input (the mid-slice is shown on the left), localize and label in that image all the vertebrae that are present. The output (projected on the mid-slice on the right) consists of labelled tight bounding boxes around the vertebrae. Note that all the 2D slices in the 3D slice stack are searched for vertebrae candidates

In more detail, the input 3D image is a (sparsely spaced) stack of 2D sagittal images, and the output consists of labelled tight bounding boxes with labels around all the vertebrae in the image. Each bounding box is specified by its position, orientation, and scale. An example of the detection and labelling for a typical normal scan is shown in Fig. 1.

This detection task is challenging for a number of reasons, including: (1) the repetitive nature of the vertebrae, (2) varying image resolution and imaging protocols; artefacts, and (3) large anatomical and pathological variation, particularly in the lumbar spine. Various examples of challenging cases in our dataset are highlighted in Fig. 2. The anatomy and pathology variation can affect both the local vertebrae / disks appearance (e.g. degraded disks—Fig. 2h), and the global layout of the spine (e.g. scoliosis—Fig. 2c).

Fig. 2
figure 2

Spine variation in our data. A collection of example images showing assorted image, anatomical and pathological modes of global variation of the spine shape, and local variation of the vertebrae, and the disks. Our algorithm is robust to all those variations. Abnormalities have been highlighted by the red arrows. a Normal spine with a zoom on a normal vertebra. b A low-resolution image. c A coronal view of a scoliotic spine, resulting in the spine not being cut by a single sagittal slice. d Top a normal sacrum, with unambiguous L5, S1 labelling based on shape and S1 and L5 orientation. Bottom a sacrum with ambiguous L5, S1 labelling based on their shape and orientation. e Joined vertebrae. fj Pathologically deformed vertebrae and disks. k Magnetic susceptibility imaging artefacts

Contributions. Our method brings together two strong algorithms—the Deformable Part Model of Felzenszwalb et al. [1] based on Histogram of Oriented Gradients (HOG) image descriptors [10] and efficient inference on graphical models [11, 12]—making the algorithm accurate, robust, and efficient on challenging spine datasets. The algorithm is also tolerant to varying MR acquisition protocols, image resolutions, patient position, and varying slice spacing unlike related solutions in the literature. It localizes all the vertebrae present in a scan, and labels them correctly as long as the sacrum is present in the scan. Importantly, the method is appliccable to standard MRI protocols.

The method has two distinct stages. First, vertebrae candidates are detected by using a sliding window detector searching over position, scale, and angle (Sect. 2.1). Second, a graphical model is fitted to the set of candidate detections to find the optimal spine layout and labelling based on the unary SVM score of the detection for each part, and a spatial cost between each pair of connected parts (Sect. 2.2). The HOG descriptor captures the near rectangular shape of the vertebrae. We detect vertebrae rather than disks since the vertebrae shape is more consistent than the disk shape as the lumbar spine studies are more often aimed at targeting disk deformations, and more suitable to be modelled with HOG. Disk locations can easily be found after detecting vertebrae.

The closest previous work to ours is that of Oktay and Akgul [13]. They detect disks and vertebrae in the lumbar spine using a Pyramid HOG descriptor; however, they only detect six disks and vertebrae with their graphical model, require the existence of both T1 and T2 scans to first detect the spinal cord, and they have a separate HOG template for each vertebrae. In contrast, we demonstrate that just one generic vertebrae detector suffices for all vertebrae, and only require the T2 scan. Furthermore, they only use the mid-sagittal slices, making it only applicable to cases where all the spine parts are in the mid-sagittal slice, whereas we sequentially search for vertebrae in all the 2D images in the 3D stack (not restricted to the mid-sagittal slice).

Ghosh et al. [14] also use HOG features [10], however they do not label the vertebrae and make strong use of heuristics and information from complementary axial scans. They detect disks rather than vertebrae. Zhan et al. [15] present a robust hierarchical algorithm to detect and label arbitrary numbers of vertebrae and disks in nearly arbitrary field of view scans, as long as one of four ‘anchor’ vertebrae (C2, T1, L1, S) are present. They first detect the ‘anchor’ vertebrae, and then other ‘bundle’ vertebrae connected to it graphically. Although the method works very well within its domain, it requires isotropic 2.1 mm resolution scans which limits its applicability severely. Our method is not limited to this domain and, in particular, does not require the high isotropic resolution.

A further extensive body of literature on spine localization and labelling exists. In almost all the papers, the algorithms work in two stages. First, some anatomical parts characteristic of the spine are detected (vertebrae [1618]/disks [14, 1921]/both [13, 15]). Second, a spine layout model is fitted to the candidates to determine the best hypothesis for the spine layout. The spatial configuration of the spine parts, and in some cases also their individual characteristics [15, 18, 22], are taken into account to both label the disks and/or vertebra, and localize the spine.

2 Method

We present a method to localize and label vertebrae in lumbar MR images using two HOG-based detectors and a graphical model. First, given a stack of sagittal MR slices, vertebrae and sacrum candidates are detected using latent SVM on HOG in each slice as described in Sect. 2.1. Next, after local non-maxima suppression, the vertebrae candidates corresponding to the spine are picked and labelled by fitting a graphical model, as explained in Sect. 2.2.

2.1 Spine Part Detection

The spine part (vertebrae) detection is implemented using two detectors constituting latents SVMs on Histogram of Oriented Gradients (HOG) descriptors [10] using the Felzenszwalb VOC Challenge object detection framework [23]. We learn one generic 2D detector for vertebrae bodies (VBs), trained on all VBs in all the training images, and another more specific 2D detector for the sacrum, trained on the VBs of the first two links of the sacrum. Both the models are visualized along with a set of training samples in Fig. 3.

Fig. 3
figure 3

The appearance model. Some training examples and a learned HOG template are shown for both the generic vertebrae body detector (a) and for the sacrum detector (b). The examples have been hand-annotated with tight Ground Truth bounding boxes as shown above and explained in Fig. 5

Training. Both the generic vertebrae body (VB) detector and the sacrum detector are trained using the Felzenszwalb detection framework [23]. The positive training examples for the VB detector are tight bounding boxes around the vertebral bodies of T10...L5 vertebrae with the bounding box sides parallel to the vertebral facets as shown in Fig. 3a. The positive training examples for the sacrum detector are tight bounding boxes around the first two links of the sacrum, with one side parallel to the posterior side of the sacrum as shown in Fig. 3b. The bounding boxes for both the VB and the sacrum are defined by fitting a minimum bounding rectangle to landmarks on them—four for the VB and eight for the sacrum. Each training sample is extracted from the slice cutting through the middle of the respective vertebral body.

For the VB detector, four HOG templates are trained, each of them of a different aspect ratio. The HOG templates are each 6 cells high, and 6, 7, 8, and 9 cells wide, corresponding to aspect ratios between 1 and 1.5. The HOG cell size for the VB model is \(8 \times 8\) pixels. The HOG template for the sacrum detector is 9 cells high by 5 cells wide, with \(8 \times 8\) pixel HOG cell size. The HOG feature vectors are 31-dimensional, with 18 contrast-sensitive, 9 contrast-insensitive direction bins; and 4 texture feature bins.

The HOG templates capture the rectangular shape of the vertebrae, with variations due to deformation, and the trapezoid shape of the first two links of the sacrum. The vertebrae show wide size and resolution variation and are all scaled and warped to match one of the aspect ratios at training. The model is learned iteratively in several steps, with new positive samples mined by running the detector on the positive samples, collecting the strongest detections as new positives, and training a new detector using the new positives.

The negative samples for the vertebrae detector are first picked randomly from mid-slices with a hand-drawn black polygon covering all the vertebral bodies. Next, an iterative learning procedure is employed to pick hard negatives as false positive detections on the negative training images as detailed in [23].

Testing. During the candidate detection step at test time, a previously unseen sagittal scan is taken as input, and tight bounding boxes around vertebrae candidates are returned as output. The candidate search is performed sequentially in all slices of the scan. The VB and sacrum detector are run on each slice of the scan, searching over position, scale, and angle, with the scan rotated by \(-20^{\circ }\) to \(20 ^{\circ }\) in \(10^{\circ }\) increments. A feature pyramid is calculated for each angle, with HOG cells placed densely next to each other. The feature pyramid has 10 levels per doubling of resolution (10 levels per octave), with the image resized and resampled to 2\(\times \) the original size to 0.5\(\times \) the original size from the finest to coarsest scale. All the detections at all positions, scales, orientations are collected and transformed onto the original test image coordinate system as shown in Fig. 4.

Fig. 4
figure 4

Vertebrae detection pipeline. a Input image. b All detections at all rotation angles and scales. The green rectangles are generic vertebrae, and the red rectangles are sacrum candidates. c All detections, with top detections shown in thick blue line, and the “\(+\)” mark the ground truth vertebrae center locations. d Output detection bounding boxes along with the ground truths and labels

A greedy non-maxima suppression algorithm is employed to remove most of the false positive detections in each slice as follows. First, the top-scoring bounding box is retained, and all bounding boxes overlapping it more than 50 % are discarded. Next, the second-highest scoring remaining bounding box is retained, and the discarding and retention process continues until all the remaining bounding boxes have at most 50 % overlap.

Next, the remaining bounding boxes from all the slices are collected and projected onto a single slice, and the non-maxima suppression process is repeated to end up with bounding boxes across all the slices that have at most 50 % overlap. These bounding boxes are next passed as input to the Graphical Model as described in Sect. 2.2 in order to eliminate any remaining false positives, and to label the vertebrae.

2.2 Graphical Model for Spine Layout

We train a parts-based graphical model [12] connecting the vertebrae in a chain. The graphical model takes as input the detections after non-maxima suppression described in the previous subsection, and gives as output the placement and labels of all vertebrae in the image. The method deals with multiple slices by ignoring the slice index in inference. The spine layout is given as a configuration \(L = (l_1,l_2,\ldots ,l_{n-1},l_{n})\) where \(l_i\) are the vertebra locations, with \(l_1\) the C1 and \(l_n=l_{25}\) the sacrum. The optimal configuration \(L^*\) of the graphical model is

$$\begin{aligned} L^{*} = \arg \min _{L} \left( \sum _{i=1}^{n} m_i\left( l_i\right) + \sum _{v_{i,j} \in G} d_{ij}\left( l_i,l_j \right) \right) \end{aligned}$$
(1)

where \(l_i\) and \(l_j\) denote the vertebrae locations \(l = (x_i,y_i,height_i,width_i,\theta _i)\) given by their location \((x,y)\), size \((height, width)\), and orientation \(\theta _i\). The best model fit minimizes the sum of the unary appearance mismatch terms \(m_i\) from the part detectors output and the spatial deformation cost \(d_{ij}\) for connected pairs \(ij\) of parts, laid at \(l_i\) and \(l_j\) respectively. The last appearance term value \(m_{25}\) comes from the sacrum detector, and the rest of the appearance term values come from the universal vertebra detector. The spatial deformation cost is a sum of box functions on \(x\) and \(y\) coordinates, the ratio of vertebrae areas \(A_i\) & \(A_j\), and the angle between the vertebrae:

$$\begin{aligned} d_{ij}(l_i,l_j) = S(A_i/A_{j}) + T(x_i-x_{j}) + U(y_i-y_{j}) + V(\theta _i - \theta _{j}) \end{aligned}$$
(2)

Here \(S\), \(T\), \(U\), and \(V\) are the box functions on area \(A\), position x & y, and angle \(\theta \) that take a low constant value if their arguments are within favourable range of each other and a higher constant value if their arguments are outside that range.

To speed up the fitting process, a Viterbi message passing scheme from [12] for fast inference in \(O(nh^2)\) time is employed where \(n\) is the number of parts and \(h\) the number of candidates per part. Typically, there are around \(h=100\) candidate positions per part, and the full inference takes around 0.1 s per MR volume. The full detection process from input to output typically takes less than a minute.

Training. The edges for the box functions \(S\), \(T\), \(U\), and \(V\) are found as the minimum and maximum argument values of those functions on the training set (e.g. the minimum and maximum \(x\)-distance between L1 and L2 for \(T\), etc.).

Testing. At test time, the whole model is fitted to the retained states, with extra “hidden” states with outside-FOV position for each part, with a penalty value for the “hidden” state learnt at training time [24]. At each position in the scan, the highest-scoring detection across all slices is retained for graphical model fitting as explained in Sect. 2.1. All retained states can be in different slices.

3 Experiments

3.1 Data, Annotation and Evaluation

The dataset consists of 371 MRI T2-weighted lumbar scans, acquired under various protocols. The scans contain normal and various abnormal cases as illustrated in Fig. 2. The dataset is split into 80 training and 291 testing images. The scans have isotropic in-slice resolution varying from 0.34 to 1.64 mm with mean at 0.78, median at 0.84 mm; and varying slice spacing from 3 to 5 mm, with 4 mm in almost all scans. The scans range in fields of view, containing 7–23 vertebrae starting from the sacrum, with median at 10 per scan.

Fig. 5
figure 5

The ground truth annotation process. A1-A3 show the generic vertebral, and B1-B3 the sacrum annotation process. There are two types of annotation: single point (the green\(+\)” in the figure—used for testing) and bounding box (the red rectangle—used for training). Given an input (A1, B1), the points (“\(+\)” and “\(\times \)”) are hand-placed (A2,B2). The bounding box annotation is found as the minimal bounding rectangle to the “\(\times \)” points around the vertebra/sacrum boundary. There are four boundary points for vertebrae (a) and eight for the sacrum (b)

Fig. 6
figure 6

Example results. Input and output are shown for six different scans a–f. The thick solid line rectangles show the detections for each vertebrae, along with their anatomical labels. Note how the algorithm is robust to varying field of view, resolution, and anatomy. Note that for visualization purposes, only mid-sagittal slices are shown and all the bounding boxes projected on them, however all the slices are searched for vertebrae candidates and the highest scoring ones retained

Annotation. The scans were hand-annotated with two types of ground truth as illustrated in Fig. 5: (i) All the vertebrae centres in all the scans are marked with a point (“\(+\)” in Fig. 5), and labelled with the vertebrae name; and (ii) all the training scans plus some test scans are annotated with a tight bounding box around each vertebra (Fig. 5A3, B3). The tight bounding boxes were defined by points (“\(\times \)” in Fig. 5) along the vertebrae boundaries as shown.

Fig. 7
figure 7

Localization error by vertebrae type. Boxplots representing detection errors are shown. The error for a given vertebra type is calculated as the distance between the center of the detected bounding box and the ground truth vertebra center, divided by the mean width of that vertebra. The mean vertebrae widths are evaluated based on the bounding boxes on the training set. The horizontal line in the middle of each box is the median error, and the bottom and top of each box are the 25 and 75 percentile errors respectively. The bottom and top error bar end are the 5 and 95 percentile errors respectively, and the ‘\(+\)’ denote statistical outliers

Evaluation protocol. The detections are evaluated against vertebrae-center and the sacrum-center ground truth points. A positive detection for the sacrum is counted if a detected sacrum bounding box contains the sacrum ground truth point and does not contain any vertebrae center ground truth points. A positive detection for the vertebrae is counted if a detected vertebra bounding box contains one and only one ground truth point for a vertebral body, including the sacrum. This evaluation protocol ensures that the cases where a detection is much larger than the vertebral body, covering several vertebrae, are not counted as positive.

3.2 Results

The algorithm is evaluated on a set of 291 lumbar spine test images with variable number of vertebrae visible. Given an input scan, both the sacrum and vertebrae detectors are run on the scan, searching over position, scale, and angle. The position search is dense, the scale varies from 0.25 to 2 of the original image size (scale = 1) in small increments (so that there are 10 different scales per doubling of image size). The angle search runs from \(-20^\circ \) to \(20^\circ \) for vertebrae and \(-60^\circ \) to \(0^\circ \) for the sacrum in \(10^\circ \) increments.

Example outputs are shown in Fig. 6, and statistical results on localization error over the test set are plotted in Fig. 7 and tabulated in Fig. 8 by vertebrae type.

Fig. 8
figure 8

Localization errors. The mean and standard deviation (std) of localization errors are shown for all the correctly detected and labelled vertebrae (identication rate \(84\,\%\) overall and \(87\,\%\) for lumbar). In adittion the “count”—the number of vertebrae detected of each type—is provided, along with the mean width of each of the vertebrae in training set. By allowing the labelling to be correct to \(+\)/\(-1\) vertebrae, the identication rates become 93 and \(95\,\%\) for all and lumbar vertebrae respectively

Fig. 9
figure 9

Detection on CT images with detectors trained on MR. Detectors trained on MR images can also successfully localize vertebrae in CT scans, indicating the robustness of the method to varying image appearance

We achieve 84.1 % correct identification rate overall, and 86.9 % for the lumbar vertebrae. The mean detection error between the Ground Truth centre of the vertebrae and the center of the detected bounding box is 3.3 mm, with standard deviation 3.2 mm. If the assigned labels are allowed to be shifted by \(+\)/\(-\) one, the errors are 92.9 and 94.7 % respectively.

Independent sacrum detection (without graphical model) with local non-maxima supression shows 98.1 % recall at 48 % precision. Independent general vertebrae detection (without graphical model) shows 97.1 % recall at 9.1 % precision.

Our method works well on very challenging examples with various anomalies illustrated earlier in Fig. 2. The identification results compare favourably to other approaches in the literature, although direct comparison is not possible since the algorithms have been evaluated on different datasets. Glocker et al. [18] report median identification error of 81 % with median localization error below 6mm on CT images. Zhan et al. [15] detect disks and vertebrae in isotropic MRI scans with 97.7 % “perfect” labelling rate as assessed by a medic but do not report detection errors. Pekar et al. [19] report 83 % correct labelling rate on 30 lumbar MRI scans. Our method is able to correctly localize the centre of the vertebrae out of the mid-sagittal slice in scoliotic cases such as image f in Fig. 6.

Application to CT images. Although the method has been principally designed for MRI images, it is directly applicable to CT images as shown in Fig. 9. No retraining is required for detection on CT due to the high generalization of HOG detectors.

4 Conclusion

We have presented a HOG-based algorithm to localize vertebrae in lumbar MRI scans of the spine that is simple, accurate and efficient. We demonstrate robustness to severe deformations due to diseases, image artefacts, and a wide range of resolution, patient position, and acquisition protocols on a challenging clinical dataset. It is straightforward to extend the method to completely general FOVs if required, by taking other anatomical context into account [18].