Keywords

1 Introduction

The human vertebral column consists of 33 individual vertebrae stacked on top of each other and connected through the ligaments and intervertebral discs (IVDs). The vertebral column is divided into cervical, thoracic, lumbar, sacral and caudal vertebrae [3]. Each of these regions performs a vital function in the human body including, absorbing shock, load breathing, protection of the spinal cord, controlling load through the vertebral column, and so on [1]. More precisely, the IVDs act as cushions of fibrocartilage and as principal joints between vertebrae and they absorb the stress and shock the body sustains during motion and allow the spine to be flexible while preventing the vertebrae from grinding against one another. Disruption in any of the vertebral discs through aging, degeneration, or injury will result in an alteration in the corresponding disc’s properties along with flaws in mechanical functionalities of adjacent tissues [19]. As a consequence, location and segmentation of intervertebral discs is a crucial task for spine disease diagnosis and provides versatile information in the quality of treatment procedure. To this end, various semi-automated and automated techniques have been proposed in the literature. These methods can be divided into two taxonomies: hand-crafted methods and deep learning-based approaches. As an example for hand-crafted dissertations, Cheng et al. [5] proposed a two-step approach where they first localize the center of each IVD by adapting a data-driven estimation framework [6] and, then, segment IVDs by classifying image pixels around each disc center as either foreground (disc) or background. Glocker et al. [11] utilized a regression forest and a probabilistic graphical model to detect and localize intervertebral discs from CT scan images. A polynomial iterative randomized Hough transform approach to segment the spine and intervertebral discs was proposed in [4]. Irrespective of the good performance of these traditional methods, in some cases they intrinsically render poor performance when compared to deep learning-based methods [2, 5]. Recent advances in deep learning have facilitated investigation of robust intervertebral disc labeling [7, 8, 20]. In [12] the authors proposed to use a standard CNN for IVD segmentation. Dolz et al. [10] proposed an architecture called ’IVD-Net’ to leverage information from multiple image modalities for inter-vertebral disc segmentation by adopting a U-Net-like architecture. In a recent article Vania et al. [20] developed a method which builds upon mask-RCNN and formulated a multi-optimization training system at a different stage to increase the computational efficiency. In another approach [21], a cross-modality method for detecting both vertebral and intervertebral discs on volumetric data has been proposed. This approach utilizes a local entropy-based texture model to localize the sacral region. Then, using three-disc entropy models, detected positions are aligned and further refined by taking into account the intensity match between regions and a spinal column template. A transfer learning-based approach is utilized by [14]. In this work, a 2D convolutional structure is exploited to detect the lumbar disc from axial images. Their proposed network uses the strength of the U-Net structure with a VGG backbone to produce a spine segmentation mask. Then, the segmented regions are used to calculate the herniation in lumbar discs. The authors of [17] combine a fully convolutional network with inception modules to localize and label intervertebral discs. Azad et al. [3] reformulated the semantic vertebral disc labeling using pose estimation and utilized an hourglass neural network to semantically label the intervertebral discs.

The main limitation of the reviewed methods is their dependency on the regular CNN learning strategy (learning texture, shape, colour) which is not optimal for labelling anatomical structures such as intervertebral discs and usually produces both false positive (FP) and false negative (FN) detections [13]. To overcome this issue, we propose to incorporate shape information within the learning process. This additional signal guides the model to selectively emphasize the contextual representation, magnifies the structural regions and supresses the less discriminative features (e.g. color, texture).

Moreover, a principal limitation of many cohort studies is that, as they utilize the local maximum technique to locate the position of the vertebral discs in 2D space on top of the prediction masks, they encounter a substantial false positive rate. Exhaustive search tree [3], template matching [18] and point coordinate condition [17] are among the popular algorithms proposed to eliminate the FP rate. However, these approaches usually lack computational efficiency and render a poor candidate recovery. Therefore, a general method is required to handle this challenge. In this work, we propose to mitigate this limitation by bolstering the post-processing step in the intervertebral disc labeling procedure. The main idea is that, inspired by the idea of YOLO [16], we propose a permutation invariant “look once” model to increase the True Positive (TP) rate while reducing the FN detection. We re-formulate the problem by a modified version of the PointNet model [15] which is invariant to certain geometric transformations (e.g. rotation). To the best of our knowledge, this is the first post-processing algorithm that processes the whole prediction in one step without any iteration (“look once”). Our contributions are as follows:

  • Adapting U-Net structure for semantic intervertebral disc labeling;

  • Incorporation of shape information to further boost model performance;

  • A permutation-invariant post-processing approach to reduce the FP rate;

  • Publicly available implementation source code (once accepted);

2 Proposed Method

Our proposed method consists of two stages. In the first stage we utilize a U-Net-based structure to detect and predict semantic labeling for each intervertebral disc location. In the second stage, we propose a deep permutation invariant “look once” model to refine the prediction results and eliminate the FP candidates. In the next subsections, we will discuss each phase in more detail.

Fig. 1.
figure 1

Proposed method for intervertebral disc labeling with incorporating shape information.

2.1 Semantic Intervertebral Disc Labeling

The concept of the proposed method is depicted in Fig. 1. In our novel design, we incorporate the shape information (gradient of the input image) as an additional signal to encourage the model to learn contextual and geometric information. To this end, we form a pyramid representation using the multi-level description resulting from each block of the encoder (U-Net encoder E parametrized with \(\theta \)) module: \(P = \{f_j = E(x, \theta ), j=0,1, ... L\}\), where L is the number of pyramid levels. Next, we propose a shape attention module. Our attention module (Fig. 2) uses the global representation of each feature map alongside the shape description to selectively emphasize the contextual representation and supress the less discriminative features. To this end, for each level of the pyramid, we learn the channel-wise recalibration parameters (\(w_j^f\)) and spatial recalibration parameters (\(w_{sp}\)) from the shape feature description (sf):

$$\begin{aligned} w_{j}^{f}=\sigma \left( \textbf{W}_{2} \delta \left( \textbf{W}_{1} G A P_{j}^{f}\right) \right) , w_{sp}=\sigma \left( \textbf{W}_{4} \delta \left( \textbf{W}_{3} G A P(sf)\right) \right) \end{aligned}$$
(1)

where \(W_{k}, k \in \{1, 2,3,4\}\) are the learning parameters that apply to the global representation (GAP) of each pyramid level, and \(\delta \) and \(\sigma \) stand for the ReLU and Sigmoid activations. We form the re-calibrated description by scaling both channel and spatial dimensions: \( \tilde{P}_{j}^{f}=w_{sp} \cdot (w_{j}^{f} \cdot P_{j}^{f} )+ sf\). Once the re-calibration performed, we aggregate the multi-level features in a nonlinear fashion (aggregation parameter \(w_{prm}\)) to produce a shape-attenuating description:

$$\begin{aligned} f'=\sigma \left( \sum _{j=1}^{L} w_{prm}^j \tilde{ P}_{j}^{f}\right) \end{aligned}$$
(2)
Fig. 2.
figure 2

Detailed structure of the proposed shape attention mechanism.

Subsequently, the same decoder as in the regular U-Net, but with \(V=11\) output channels (we assume that the input image comprises, at most, 11 intervertebral discs according to [9]), is utilized to estimate the location of each intervertebral disc accordingly. Similarly, our ground truth mask consists of V channels, where in each channel the location of an intervertebral disc is labelled with a Gaussian kernel of radius 10. We employ the mean squared (MSE) loss to train the network.

2.2 Refinement Network

Detecting intervertebral disc locations often comprises FP and FN predictions. Several post-processing approaches were proposed in the literature to overcome this problem. Rouhier et al. [17], deploys a condition-based strategy to eliminate the FP candidate generated by their countception method. In a recent article, Azad et al. [3] argues that the condition-based strategy usually fails to recover the TP candidates among the detected regions and proposes a tree-based decision space. Their approach suggests creating a search tree, where each path shows one possible combination of ordered intervertebral disc locations. Then, they calculate an error function between the general skeleton and the predicted skeleton. This iterative algorithm performs an exhaustive search and is not efficient when the number of FP is high. Template matching [18] is also another approach that seeks to reduce the FP rate by considering predefined patterns.

These methods all have their assumption of particular conditions or predefined patterns in common. In addition, some of these methods perform the selection in an iterative fashion, which may not be feasible when the number of FP is high. To mitigate these issues we propose a method to ’look only once’ at the noisy prediction to recover the intervertebral disc locations. To this end, we assume that, for the input image I with N intervertebral disc location, the detection model predicts a set of M intervertebral disc candidates, usually \(M>=N\) and \(M\in {R^2}\) (i.e. 2D position). Taking into further consideration in a general form, we assume that the prediction model is not able to provide any semantic labelling. Thus, the objective is to recover N points out M which best matches the ground truth intervertebral disc locations. Since the semantic information is not provided for the predicted points, we consider it as a set of M intervertebral candidates. The set is made up of unstructured data and selecting N intervertebral disc location out of M candidates requires the following processing permutations:

$$\begin{aligned} \frac{(M)!}{(N)! (M-N)!} \end{aligned}$$
(3)

permutations. The processing time will dramatically increase if \(M>>N\). To overcome this limitation, it is highly desirable that the post-processing algorithm processes the whole prediction at once without any iterations(“look once”). Therefore, the deep model needs to be permutation invariant, i.e., any order of points should produce the same result. The proposed structure is depicted in Fig. 3. The proposed method consists of two data streams, where in the first stream (top), a series of feature transformation layers, followed by the multilayer perceptron (MLP), is designed to encode the input coordinate into a high-level representational space. The objective of this representation is to create a discriminative embedding space to characterize each point by a hidden dependency underlying the input data. Intrinsically, the transformation layer in this stream assures the robustness of the representation to the noisy samples and provides a less sensitive transformation to an affine geometrical transformation (e.g. rotation). Inspired by the permutation invariance characteristics, the MLP layer deploys a shared kernel to produce a set of representations independent of their order. Eventually, in addition to the generated feature map, a symmetric function (global pooling) is utilized to capture the shared signature among all points. We concatenate the global information with the local representation of each point to describe each intervertebral disc candidate. Details on the network structure is illustrated in Table 1. This representation more or less contains the general structure of the data, however, it still requires pair-wise relational information. To include such information, we create a geometrical representation. To this end, using the fully connected layers, we learn the embedding parameters to model the long-range geometrical dependency. The main objective of this layer is to capture the geometrical relation between points and feed it to the scaler function. We include the sigmoid function on top of the generated representation to form an attention vector. This attention vector performs the re-calibration process and adaptively scales the generated feature map. The generated final representation is then fed to the single-layer perceptron model to perform the softmax operation and to classify each candidate.

Fig. 3.
figure 3

Proposed structure for the post-processing step. The noisy prediction without a semantic label passes through the model to eliminate the FP candidates.

Table 1. Details on network architecture for the post-processing stage. We follow [15] for the structure of the Feature Transform module (including T-Net) which simply aligns the input to a feature space using an affine transformation without changing the dimension. We refer the reader to [15] for more general expositions. Note that n denotes the number of vertebral discs detected.

3 Experimental Results

In this section, we first describe the datasets and metrics used throughout our experimental evaluation. Then, we provide a deep insight into the experimental results. Our analysis was based on the publicly available Spine Generic Dataset [9]. The dataset was acquired across 42 centers (with a total of 260 participants) worldwide, accommodating both T1 and T2 MRI contrasts for each subject. Images obtained from diverse institutes, considerably varying in image quality, ages and imaging devices, render a feasibly challenging benchmark for the task of intervertebral disc labelling.

3.1 Metrics

To ensure the validity of the comparison of results and to draw conclusions on the applicability of our approach, we consider different comparison metrics. In the first instance, we take into account the L2 norm by calculating the distance of the vector coordinate between each predicted intervertebral disc location and the ground truth while considering the superior-inferior axis to quantify the punctuality of our proposal. In order to gain insights into the versatility of our post-processing approach, the False Positive Rate (FPR) and False Negative Rate (FNR) were selected as the primary inclusion criteria. Similar to [3], the FPR calculates the number of predictions which are at least 5 mm away from the ground truth positions. Likewise, the FNR counts the number of predictions where the ground truth has at least 5mm distance from the predicted intervertebral position.

3.2 Comparison of Results

Table 2. Intervertebral disc labeling results on the spine generic public dataset. Note that DTT indicates Distance to target
Fig. 4.
figure 4

(a): Intervertebral labeling results of three representative T2 images. upper row: ground truth, lower row: predictions. (b): Before (left) and after (right) applying look-once approach on the T1 generated noisy prediction.

We train all of our models upstream using the Adam solver with the momentum in 100 epochs with the batch size 2. In our experiments, we use an initial learning rate of 0.0001 with the decay by a factor of 0.5 at every 20th epoch, respectively. We use the same setting as explained in [17] to achieve a general consensus in comparing our method with the literature and we report our findings in Table 2. Note: our baseline model uses the same structure as presented but without employing the proposed modules. The results show that our approach achieves a competitive result in T1 and T2 contrasts. Specifically, our proposed method shows superior performance in T2 contrast, where our approach prominently outperforms all other approaches in terms of FNR and distance to the target. Compared to the pose estimation approach [3], our method produces on T1 modality an average lower distance to the intervertebral locations, but there is only a small gap in distance variance. We also observe that, by removing the proposed modules the performance of the model slightly decreases, which highlights the importance of shape information in intervertebral disc labeling. Moreover, unlike the countception and template matching approaches, our method does not require a heavy preprocessing step for spinal cord region detection and outperforms these methods with both quantitative performance and inference time. In contrast to our proposal, the inference time in the two aforementioned approaches grows exponentially when the FP rates increases (see Table 3). In Fig. 4(a) we provide sample results of the proposed model on T2 modalities. It can be observed that the method precisely provides a semantic label for each IVD location without any FP predictions. It should be noted that our method requires less processing time even with large number of FP detection in opposite to the SOTA approaches (illustrated in Fig. 5).

Table 3. Performance comparison of the proposed post-processing approach vs the SOTA approach for eliminating FP detection. The experiment was done on 100 images, where for each image 20 random FP detection was added.
Fig. 5.
figure 5

Inference time of the proposed method vs the search-tree based approach [3]. Our method only looks once at the prediction to eliminate the FP samples while the search based approach uses an iterative algorithm.

3.3 Evaluation on the Noisy Prediction

To further analyze the robustness of the proposed method in the presence of noisy predictions, we attain an evaluation on the proposed “look once” post-processing method. To this end, we create a 2D Gaussian distribution around each intervertebral disc to generate new points. A sample of generated noisy image along with the model prediction is depicted in Fig. 4(b). As shown, the proposed method works well (including very fast timing) on retrieving IVD locations from the noisy prediction without relying on any predefined assumption. In addition, in our experiment (supplementary file), we observe that for the search-tree-based approach the post-processing time exponentially increased with the increase of FP rate. Similarly, the template matching method failed to recover the TP candidates in most of the cases. Whereas, our method recovered the TP samples with high precision without any iteration. Moreover, to disentangle the contribution of our proposal, we take a closer look at some additional sample detections of our method in Fig. 6 which proves its efficiency in terms of perceptual realism.

Fig. 6.
figure 6

More results of the proposed method for intervertebral disc labeling on T1w images. The first row shows the grand truth while the second row shows the predicted intervertebral disc along with the semantic labeling (color).

4 Conclusion

In this paper, we systematically formulate the intervertebral disc labelling problem by designing a novel method to incorporate shape information. The proposed method encourages the model to focus on learning contextual and geometrical features. Additionally, we propose a “look once” post-processing approach. Powered by this, our model alleviates the false-positive detections along with a substantial refinement in model acceleration. The results presented in this paper demonstrate the potential of our methodology across all competing methods.