1 Introduction

Fig. 1
figure 1

a An anterior-posterior X-ray image showing how to measure the Cobb angle. Green points are the four landmarks of every vertebra. Blue points are the landmarks of the two most tilted vertebrae. b Corresponding manually generated binary mask

Fig. 2
figure 2

An illustration of the LEN. The gray part is the segmentation network based on an FCN. The blue part is the landmark estimation network and it uses fully convolutional networks to estimate landmarks. Inputting a 3-channel image, the LEN outputs a binary mask heatmap and 136 estimated coordinates of 68 landmarks

Fig. 3
figure 3

A series of compared networks which combine different level features as ablation experiments. The only differences between these models are the different concatenating operations

Adolescent idiopathic scoliosis (AIS) is defined as a spinal torsional deformity combined with different degrees of rotational spinal deformity [1]. According to the current literature, 0.47–5.2% [2] of children have different degrees of scoliosis.

The Cobb method [3] is considered as a classical and efficient way to quantitatively measure the angle of scoliosis both on the coronal and the sagittal plane such as the Cobb angle [4]. The Cobb angle is the angle between the two most tilted vertebrae, specifically between the upper endplate of the uppermost vertebra and the lower endplate of the lowest vertebra, as shown in Fig. 1. However, manual Cobb angle measurement is time-consuming on X-rays with a low contrast because clinicians find four landmarks on every vertebra and compare the slope of them to measure the Cobb angle.

In this paper, we propose an automated architecture for the Cobb angle measurement. The architecture uses both segmentation and landmarks of vertebrae to supervise estimated landmarks. In addition, the architecture considers spinal curvature as a constraint to estimate the Cobb angle.

2 Related work

Recent studies based on deep learning have proposed some effective methods for the Cobb angle measurement on X-rays. These methods can be divided into two categories: (1) direct landmark estimation methods, and (2) indirect segmentation methods.

The direct landmark estimation methods aim to directly capture landmarks of interest on X-rays, which are like the manual process. Such as landmarks with a structured multi-output regression network are predicted in [5]. The use of Boost-net to find landmarks by transforming the feature space is proposed in [6]. A series of methods (MVC-Net, MVE-Net) that use multi-view (anterior-posterior and lateral) X-rays together joint features of multi-view X-rays [7, 8]. An AEC-Net uses calculated Cobb angles by rules and estimated angles to correct the Cobb angle error [9].

The indirect segmentation methods aim to segment the vertebrae of interest on X-rays and then measure the Cobb angle based on the segmentation. Such as in [10], an automated model is proposed for spine segmentation, and a polynomial to fit the spinal curvature. Owing to the high-level performance of U-Net [11] in the medical field, numerous studies are based on the use of U-Net, such as dense U-Net [12], residual U-Net [13] and shape-aware U-Net [14]. In [15], an automatic DU-Net segmenting the spine based on deep learning is proposed, and a sixth polynomial to characterize the spinal curvature. In [16], an MBR-Net based on U-Net to segment the images and a minimum bounding rectangle is considered vertebrae. The Mask RCNN [17] is used to segment vertebrae, and the centers of segmentation are used to calculate the Cobb angle [18]. The U-Net is used to segment lumbar vertebrae and estimate the lumbar lordosis angle on lateral X-rays in [19]. In [20], the Mask RCNN is used to segment vertebrae and a small network to estimate landmarks on anterior-posterior X-rays.

These two methods both achieve high performance for the Cobb angle measurement. However, these methods separately use the segmentation and landmark information to supervise networks. In this light, we propose an automated architecture which uses combined segmentation and landmark information. It takes the segmentation as an auxiliary task to estimate landmarks and uses spinal curvature to estimate the Cobb angle on anterior-posterior X-rays. The experiment results show that our method achieves smaller error on landmark and Cobb angle estimation.

3 Methods

3.1 Overview

In this study, we propose a landmark and Cobb angle estimation network (LCE-Net). The architecture consists of two parts: (1) a landmark estimation network (LEN), and (2) a Cobb angle estimation network (CEN). The LEN first estimates 68 landmarks of 17 vertebrae (12 thoracic vertebrae and 5 lumbar vertebrae) by taking segmentation as an auxiliary task. Then, the CEN uses the spinal curvature described by 68 landmarks to estimate the Cobb angle by considering spinal curvature as a constraint.

3.2 Landmark estimation network

We assume that the locations of the landmarks and the segmentation have a potential relationship in a physical space. The LEN combines the features of two kinds of networks: (1) the network for segmentation (NFS) and (2) the network for landmark estimation (NFL). Because of the FCN [21] achieves high performance for image segmentation, the LEN takes the FCN as the NFS and a simple network composed of fully convolutional layers as the NFL. The LEN combines the segmentation and landmark information by concatenating features of the NFS and NFL.

The architecture of the LEN is shown in Fig. 2. Inputting an X-ray, the LEN outputs a pixel-wise segmentation heatmap and landmark coordinates:

\(I \rightarrow h,c \)

where I means an image, h means an estimated heatmap and c means estimated landmark coordinates formatted as \({\mathrm{EC}} = [x_1^e,y_1^e,\ldots ,x_{68}^e,y_{68}^e]\). The landmarks are arranged from top to bottom and from left to right. We scale the estimated landmarks and ground-truth landmarks between 0 and 1. Estimated coordinates are normalized by a sigmoid function: \({\mathrm{EC}} = \frac{1}{1+e^{-x}} \). Ground-truth coordinates are normalized by \( {\mathrm{GC}} = [x_1^g/w,y_1^g/h,\ldots ,x_{68}^g/w,y_{68}^g/h] \), where w and h are the width and height of the image size.

Two types of loss functions are used in training stage: (1) a mean squared error loss is used for comparing estimated landmarks to ground-truth landmarks per image:

$$\begin{aligned} {\mathrm{Lmse}} = \frac{1}{N} \sum \limits _{i=1}^N ({\mathrm{EC}}_i - {\mathrm{GC}}_i)^2 \end{aligned}$$
(1)

where N means the number of coordinates, 136 (68 x-coordinates and 68 y-coordinates) in our experiment. (2) A cross-entropy error loss is used for comparing estimated segmentation heatmaps to ground-truth heatmaps per image:

$$\begin{aligned} {\mathrm{Lcee}} = - \frac{1}{\mathrm{WH}} \sum \limits _{i=1}^{\mathrm{WH}} (y\log \hat{y}) \end{aligned}$$
(2)

where W and H mean the width and height of the image, y means the ground-truth label and \(\hat{y}\) means the estimated probability of every pixel. Ground-truth segmentation heatmaps are constructed by modeling pixels of vertebrae as 1 and background as 0. The full training loss of the LEN is \({\mathrm{Loss}}={\mathrm{Lmse}} + \varphi \times {\mathrm{Lcee}} \) where \(\varphi \) is the weight to balance the segmentation and landmark estimation task, 0.05 in our experiment.

As ablation experiments, we design a series of networks that combine different level features of the NFS and the NFL. These architectures are shown in Fig. 3. All convolutions except the last layer of the LEN in our proposed model use a \(3\times 3\) convolution kernel with a stride of 1 (with padding = 1); the last convolution uses a \(4\times 2\) convolution kernel, followed by batch normalization (BN) [22], prelu, and a dropout with a 25% probability [23].

Fig. 4
figure 4

An illustration of the CEN. Inputting an estimated landmark vector, CEN outputs the Cobb angle scaled between 0 and 1

3.3 Cobb angle estimation network

We found that a small landmark error can cause a big Cobb angle error because the slope of drawn lines shown in Fig. 1 may change too much. Addressing this issue, we assume that spinal curvature and the Cobb angle have a potential relationship. Unlike the manual measurement process which compares the most oblique vertebrae, the CEN uses spinal curvature described by 68 estimated landmarks as a constraint to estimate the Cobb angle. The architecture is shown in Fig. 4. The CEN takes EC as input and output the estimated Cobb angle:

\( c \rightarrow a\)

where a means the estimated Cobb angle. We also scale estimated Cobb angles and ground-truth Cobb angles between 0 and 1. Estimated Cobb angles are normalized by a sigmoid function: \( {\mathrm{EA}} = \frac{1}{1+e^{-x}} \). Ground-truth Cobb angles are normalized by \({\mathrm{GA}} / 180^{\circ }\). A mean squared error loss is also used for comparing estimated angles to ground-truth angles:

$$\begin{aligned} {\mathrm{Lmse}} = \frac{1}{N} \sum \limits _{i=1}^N ({\mathrm{EA}}_i - {\mathrm{GA}}_i)^2 \end{aligned}$$
(3)

where N means the number of images.

4 Experiments

4.1 Dataset

Our dataset consists of 1200 spinal X-rays with an average pixel resolution of \(957 \times 491\) provided by a local hospital. Four landmarks and the segmentation mask of each vertebra are labeled by two professional clinicians. Every clinician labels the half images, and labels are checked by each other. Each clinician has 8 years of experience. We scaled all images to a pixel resolution of \(512 \times 256\). The range of the Cobb angle is distributed from \(1.56^{\circ }\) to \(91.74^{\circ }\) in our dataset.

4.2 Training details

The experiments were run on a PC with Ubuntu 14.04, and an NVIDIA GeForce GTX 1080Ti GPU. The code implementation of the architecture is based on the Pytorch framework in Python. The learning rates of LEN and CEN both are set to 0.001 and the momentum is set to 0.9 during the stochastic gradient descent (SGD). The 1200 X-rays are divided into the training set, validation set, and test set randomly in every training session, where the proportion is 6:2:2. The results are the average performance of 5-folds validation.

4.3 Performance metrics

For the landmark estimation, we use the landmark mean absolute error (LMAE) to calculate the error. The LMAE is defined as follows:

$$\begin{aligned} {\mathrm{LMAE}}=\frac{1}{M}\frac{1}{N}\sum _{j=1}^M \sum _{i=1}^N \vert {\mathrm{EC}}_i-{\mathrm{GC}}_i \vert \end{aligned}$$
(4)

where M is the number of images and N is the number of coordinates per image, 136 (68 x-coordinates and 68 y-coordinates) in our experiments.

For the Cobb angle estimation, we use the angle mean absolute error (AMAE) and symmetric mean absolute percentage error (SMAPE) to calculate the error:

$$\begin{aligned} {\mathrm{AMAE}}=\frac{1}{M}\sum _{j=1}^M \vert {\mathrm{Angle}}^{\mathrm{est}}_i - {\mathrm{Angle}}^{\mathrm{gt}}_i \vert \end{aligned}$$
(5)
$$\begin{aligned} {\mathrm{SMAPE}} =\frac{100\%}{M}\sum _{j=1}^M \frac{\vert {\mathrm{Angle}}^{\mathrm{est}}_i - {\mathrm{Angle}}^{\mathrm{gt}}_i \vert }{(\vert {\mathrm{Angle}}^{\mathrm{est}}_i \vert + \vert {\mathrm{Angle}}^{\mathrm{gt}}_i \vert )/2} \end{aligned}$$
(6)

where \({\mathrm{Angle}}^{\mathrm{est}}_i\) means estimated angles or calculated angles by estimated landmarks, and \({\mathrm{Angle}}^{\mathrm{gt}}_i\) means ground-truth angles. The method of calculating angles by landmarks is like the manual process shown in Fig. 1:

$$\begin{aligned} Angle=\vert \arctan {\frac{y_{2}^{\mathrm{up}}-y_{1}^{\mathrm{up}}}{x_{2}^{\mathrm{up}}-x_{1}^{\mathrm{up}}}} - \arctan {\frac{y_{2}^{\mathrm{low}}-y_{1}^{\mathrm{low}}}{x_{2}^{\mathrm{low}}-x_{1}^{\mathrm{low}}}}\vert \end{aligned}$$
(7)

where \(x^{\mathrm{up}}_i\) and \(y^{\mathrm{up}}_i\) mean landmark coordinates on the upper endplate of the uppermost vertebra, \(x^{\mathrm{low}}_i\) and \(y^{\mathrm{low}}_i\) mean landmark coordinates on the lower endplate of the lowest vertebra. The upper and lower endplates are the two edges of the two most tilted vertebrae such as the two red lines shown in Fig. 1.

Table 1 Comparison with existing methods on X-rays
Table 2 Comparison with a series of networks which combine different level features
Fig. 5
figure 5

Some visual results of the proposed method (d) and existing methods (a, b and c)

5 Results and discussion

5.1 Results

We compare our framework with other methods. We also compare the LEN with the NFL for landmark estimation. The results are shown in Table. 1. From the results, the LEN reduces the error of landmark estimation and the LCN reduces the Cobb angle error. As shown in Table 2, we also compare the landmark estimation performance on a series of networks which combine different level feature shown in Fig. 3. The data in Tables 1 and 2 are calculated by Eqs. 4,  5 and  6.

From Table 1, the LEN achieves less landmark estimation error due to the use of more information. It uses the information of two tasks to supervise the landmark estimation while existing methods only use single information. The CEN achieves a smaller error of the Cobb angle estimation than the LEN due to considering spinal curvature as a constraint. It captures the relationship of the spinal curvature and the Cobb angle, which is more robust against the rules.

From Table 2, as ablation experiments, the LEN and models a to d almost have the same performance. They both achieve less error than the LEN without the segmentation branch due to using the similar multi-task network architecture. Moreover, the LCE which uses most level features achieves higher performance than others.

Figure 5 shows some visual results of the proposed method and existing methods.

5.2 Discussion

Existing methods directly estimate landmarks or segment vertebrae, and then they use rules to calculate the cobb angle such as calculating the center points of vertebrae [18] and fitting lines to be bounding box of vertebrae [16]. This may lead to a big angle error while there is a small segmentation error and landmark error. The LCE-Net avoids this issue due to two parts: (1) the LEN uses segmentation as an auxiliary task giving more information to estimate landmarks, which leads to more information utilization, and (2) the Cobb angle is estimated by spinal curvature instead of calculated by 7. Therefore, this method is more robust than the rules while some pivot landmarks are estimated with errors. The results demonstrate that our method is more robust both for landmark and angle estimation on X-rays.

This study has limitations. The LCE-Net uses a multi-task network to estimate landmarks, and this leads to more labeled information. For the same reason, the LCE-Net increases the computational cost, luckily not too much, and the computation time of the developed system is \(0.16\pm 0.005\) in the test stage. In terms of practical perspective, our method can meet the time and cost requirement to be integrated into clinicians’ workflows.

6 Conclusion and future studies

In this paper, we first notice that existing methods for the Cobb angle estimation on X-rays use the segmentation and landmark information separately. To use the combined information, we propose a multi-task network that takes segmentation as an auxiliary task to estimate landmarks. It achieves higher performance than existing methods on landmark estimation. In addition, to avoid a big angle error caused by a small landmark error, we propose a Cobb angle estimation network that uses spinal curvature described by 68 landmarks to estimate the Cobb angle instead of pivot landmarks to calculate by rules.

As future work, we plan to analyze whether we can apply our methods on 3-D images or combine X-rays in different directions. Future studies will also explore whether our methods can be used to estimate other clinical parameters based on spinal curvature.