1 Introduction

Adolescent idiopathic scoliosis (AIS) is a three-dimensional spinal condition, characterized by lateral curvature and axial vertebral rotation. It affects approximately 3% of adolescents [1]. The Cobb angle, which is the gold standard for quantifying spinal curvature, is measured on posteroanterior (PA) radiographs by identifying the angle between the upper end plate of the superior vertebra and the lower end plate of the inferior vertebra [2], shown in Fig. 1a. To decide which treatment regimen is appropriate, the Cobb angle is measured at a scoliosis clinic. If the measured Cobb angle has increased by more than 6° between two consecutive visits, it indicates that the curve has progressed, and treatment may be recommended to stop the curve progression. Therefore, it is essential to have accurate measurements which ensure that patients receive appropriate treatment in a timely manner. However, manual measurements introduce inter-observer and intra-observer variations and require time and effort to measure in the clinic [3].

Fig. 1
figure 1

a A Cobb angle measurement on a posteroanterior radiograph; b a standard X-ray radiograph, c a low-dose radiation EOS radiograph, d pre-processed version of the EOS radiograph from c with segmented spinal column (yellow) and midline (blue)

Machine learning is a rapidly developing field that has successfully tackled problems that previously challenged traditional computing methods by simulating human learning methods. This is useful because it allows computer programming to solve more general and wide-ranging problems that do not have a specific set of outputs. One such problem is computer vision and image segmentation, which has been changed by the introduction of the convolutional neural network (CNN). These networks search over an image several times to identify and mark distinguishing features [4]. Of particular success is the U-net from Ronneberger et al., which is a CNN architecture specially designed for segmenting images with small training datasets [5]. This makes the U-net perfect for medical imaging applications with limited training samples.

Recently, Horng et al., developed a fully automatic program that used image processing to isolate the spinal column and a U-net CNN to segment the individual vertebrae. Using the vertebra segmentations, they measured the Cobb angle with an inter-method reliability of 0.94–0.97 between manual and automatic measurements. However, their test set comprised of only 35 images and contained only mild Cobb angles of less than 20°, which is a small subset of overall cases [6]. Fu et al. used a two-network architecture for estimating landmarks and calculating the Cobb angle using these landmarks [7]. They achieved a MAD of 3.15° on 240 radiographs. However, the distribution of Cobb angle severities in their test set is unclear. Other groups have reported on their own automated methods, but do not achieve comparable results to these two papers [8,9,10,11,12].

The objectives of this paper were to report a new quality-based iterative CNN algorithm that automatically measures Cobb angles on PA radiographs with a wide range of curve severities and to determine the accuracy of this method by comparing automatic measurements with clinical records. Developing a fully automatic and accurate Cobb angle measurement algorithm would reduce measurement variation and clinical workload, therefore improving diagnostic accuracy and streamlining clinical workflow. The accuracy of the Cobb angle measurement is important because it affects treatment decisions.

2 Materials and Methods

2.1 Data

From a local scoliosis clinic, over 600 PA radiographs of children with AIS were available for use. Among these radiographs, 300 were acquired using the standard X-ray system (between 2010 and 2015), while the remainder were acquired using the new EOS X-ray imaging system (between 2015 and 2020), which uses an ultra-low dose of radiation. Figure 1b and c compare the image quality of the radiographs obtained from the standard and EOS X-ray systems, respectively. The average size of a PA radiograph obtained from the standard and EOS X-ray systems is 4820 × 2335 and 3465 × 1600 pixels, respectively. Out of the 600 available PA radiographs, 238 were randomly selected and broken down into three groups (120, 18, and 100 images). The first 120 PA images were used for spinal column segmentation, of which 110 were used for training and 10 were used for validation. The 18 PA images were used to generate 282 vertebra images which were used for training (272 images) and validation (10 images) of vertebra segmentation. The last 100 images were used for automatic Cobb angle measurement accuracy testing. Ethics approval (Pro00102044) was granted from the University of Alberta Health Research Ethics Board.

2.2 Proposed Method

The automatic Cobb angle measurement algorithm was broken down into two steps: (a) segmentation of the spinal column and (b) segmentation of the individual vertebrae. The spinal column segmentation aimed to identify the region starting from the top thoracic vertebra (T1) to the bottom lumbar vertebra (L5) on a PA radiograph. Using the spinal column as a guide, the individual vertebrae were located with an iterative algorithm and subsequently segmented. The CNNs used for segmentation were developed using Python, coupled with Tensorflow libraries, and were trained on a Linux virtual machine hosted on the Industry Sandbox & AI Computing (ISAIC) supercomputer, using an NVIDIA Tesla V100 16 GB GPU and an Intel Xeon Gold 6138 dual processor. All images related to CNN development were labelled manually by the primary author using a MATLAB segmentation graphical user interface. The primary author, a novice scoliosis researcher, used 10 extra images to practice and confirm with a senior author who had 20 years of scoliosis research experience prior to labelling the training images.

2.2.1 Spinal Column Pre-processing

To reduce extraneous information from being passed to the CNN, a region of interest (ROI) centered around the spinal column was identified and cropped from the PA radiographs. First, the head was identified, as it was positioned directly above the T1 vertebra. The image intensity of the head was higher than its immediate surrounding area, so the maximum intensity vertical projection of the top 50 rows of the radiograph was calculated. The ROI was determined by setting the region of highest intensity in this projection as the ROI’s center, then extending it to 500 pixels on either side, creating a total width of 1000 pixels. Contrast limited adaptive histogram equalization (CLAHE) was applied to the ROI to accentuate the spinal column boundaries. The cropped and histogram equalized images were then resized to 256 × 128 pixels to standardize the input image size for the CNN. These pre-processed images, along with their labels, were used to train the CNN for spinal column segmentation. Labelling the spinal column involved annotating a ground truth semantic segmentation, similar to the yellow highlighted area in Fig. 1d, where the whole spinal column from T1–L5 was labelled.

2.2.2 Spinal Column Segmentation

A CNN with a U-net architecture was used to segment the spinal column from the pre-processed images [5]. The architecture, which is similar to that of Horng et al. [6], is illustrated in Fig. 2. The spinal column CNN was trained for 400 epochs using a binary cross entropy loss function and an Adam optimizer with a learning rate of 10−4 [13]. A dropout probability of 0.1 was used after each downsampling and upsampling layer, and a batch size of 1 was used.

Fig. 2
figure 2

Network architecture for the spinal column segmentation network. The vertebra segmentation network varies only in terms of input image size

A total of 120 PA radiographs were manually labelled and split into a 110-image training set and 10-image validation set. The training set consisted of half regular and half EOS radiographs, while the validation set contained only EOS radiographs. Data augmentation methods of random horizontal flips and rotations (≤ 10°) were applied to increase the effective training set size, and therefore the robustness of the CNN.

The segmentations underwent post-processing to optimize performance in the upcoming iterative vertebra location algorithm. This consisted of keeping only the largest connected component of the segmentation and estimating the spinal column shape by fitting a tenth-degree polynomial to the segmentation. This polynomial was denoted as the “midline” and is illustrated in Fig. 1d, along with the pre-processed radiograph and spinal column segmentation.

2.2.3 Vertebra Pre-processing

Square images of the vertebrae were manually cropped from the PA radiographs and labelled to train a CNN for automatic vertebra segmentation. Like the spinal column images, CLAHE was applied to these images to highlight the vertebral boundaries. The pre-processed images were then resized to 128 × 128 pixels. Figure 3a illustrates a pre-processed cropped vertebra image. These pre-processed images and their respective labels were used to train the CNN for vertebra segmentation. Labelling the vertebrae involved annotating a ground truth semantic segmentation of each vertebral body, similar to the yellow highlighted area in Fig. 3b.

Fig. 3
figure 3

a Cropped region of vertebra, b segmented vertebra, c segmented vertebra with bounding box

2.2.4 Vertebra Segmentation

Vertebra segmentation was accomplished using the same U-net architecture shown in Fig. 2, except the input size of the images was 128 × 128 pixels. It was trained for 200 epochs using a binary cross entropy loss function and an Adam optimizer with a learning rate of 10−4. No dropout was performed and a batch size of 1 was used.

A total of 282 vertebra images were manually labelled and split into a 272-image training set and 10-image validation set. Both the training and validation set comprised of half regular and half EOS radiographs. To increase the effective training set size, random horizontal flipping, rotations (≤ 45°), zooms (80–120%), and horizontal and vertical shifts (≤ 10%) were employed as data augmentation methods.

Like the spinal column, only the largest connected component of the vertebra segmentation was kept. A minimum bounding box was placed around each vertebra segmentation to determine its tilt angle relative to horizontal. Figure 3b and c display a segmented vertebra and that segmented vertebra with a bounding box, respectively. Further methods of ensuring high quality vertebra segmentations are detailed in the iterative vertebra location sub-section.

2.2.5 Iterative Vertebra Location

To find all the individual vertebrae within a spine, an iterative algorithm was employed. To identify the first vertebra, five images were cropped around two-thirds of the way down the spinal column segmentation, where the clearest vertebra (T12) is normally present. The images were separated vertically by a quarter of the width of the spinal column segmentation, to be certain that at least one centered image of the vertebra was obtained. Each of these images was segmented by the CNN and the segmentation with the best quality was selected. A new region was then cropped above the initial segmentation at a distance equal to the height of the initial segmentation, with a width of about the spinal column segmentation at the new height. If the quality of this segmentation was good, this step was repeated, resulting in segmentation of the next vertebra upwards. If the quality of the segmentation was poor, the cropping window was increased in height or width, and the vertebra segmentation was attempted again. Quality was deemed poor if the quality variance or mean (Eqs. 1 and 2) exceeded a value of 100 or 10, respectively. In this manner, vertebra segmentation continued upwards until it reached the top of the spinal column segmentation. Finally, segmentation was performed from the initially segmented vertebra downwards until L5.

The quality of a predicted vertebra segmentation was determined automatically by comparing it to several standard vertebra masks from the training images and then calculating an original quality coefficient inspired by Al Arif et al. and their shape-aware parameter [14]. First, the standard masks were rotated to match the angle of the predicted segmentation. Then, the outlines of the predicted mask and each standard mask were compared by finding the distribution of the minimum distances between each point on the predicted mask contour and the standard mask contour. Figure 4a shows an example of the distribution of minimum distances of a predicted mask with a poorly matched standard mask and Fig. 4b shows that distribution with a well-matched mask. The segmentation quality was determined by both the mean and variance of this distribution—the lower these two values, the higher the quality of the segmentation. The lowest mean and variance with one of the standard masks was taken as the quality metric for the predicted segmentation. Several standard masks were used during comparison because the contours of thoracic vertebrae are different from those of lumbar vertebrae. Equations 1 and 2 detail the formulae for the quality measurements:

$$Quality \,Variance=\underset{i}{\mathrm{min}}\{\mathrm{var}({x}_{i})\}$$
(1)
$$Quality \,Mean=\underset{i}{\mathrm{min}}\{\mathrm{mean}\left({x}_{i}\right)\}$$
(2)

where xi is the distribution of minimum distances between each point on the predicted mask and the ith standard mask.

Fig. 4
figure 4

Distribution of minimum distances between a predicted and standard mask for: a a poor match and b a good match

2.2.6 Cobb Angle Measurement

When measuring the Cobb angle, any vertebrae with angles that were significantly different from the spinal column midline angle at the vertebra location were discarded, and then angles significantly different from the angles of their neighboring vertebrae were discarded. The steepest remaining pair of opposing angles was then used to measure the Cobb angle for each curve containing one apex. An apex occurred wherever the centroids of the vertebra segmentations reached an extremum. The pseudocode for the algorithm is displayed in Table 1.

Table 1 Pseudocode for the complete iterative algorithm

2.3 Validation

2.3.1 Network Segmentation

For both spinal column and vertebra segmentation networks, the 10-image validation sets were used to evaluate their performance. The mean accuracy and mean Dice coefficient were used as metrics. The ratio of validation to training images was small, but success of the networks was more strongly evaluated on how well they automatically measured Cobb angles. The spinal column training and validation sets had average Cobb angles of 24.5° ± 13.2° (range 6°–97°) and 25.6° ± 10.8° (range 8°–43°), respectively.

2.3.2 Cobb Angle Measurement

The automatic CNN-based algorithm was run on a new test set of 100 radiographs, and the difference between each automatic and manual Cobb angle measurement was found. All the manual measurements were performed by a clinician with over 20 years of experience and the variation was within 4°. In the 100 test radiographs, 177 Cobb angles were manually measured with an average Cobb angle of 24.8° ± 10.1° (range 9°–52°). The mean absolute difference (MAD), standard deviation of absolute differences (SD), Pearson correlation coefficient (r), and percentage of differences within clinical acceptance (≤ 5°) were reported.

3 Results

Figure 5a shows a test image with all vertebrae segmented, and Fig. 5b shows the final result of the automatic algorithm. In this example, the algorithm detected 3 curves: T3–T5, T5–T12, and T12–L4. However, the clinician’s measurement only had two curves: T5–T12 and T12–L4 as the slope of T3 was small (< 1°) and therefore not considered as part of a curve.

Fig. 5
figure 5

a Results of the iterative algorithm to find individual vertebrae on a test image, b bounding boxes on the vertebrae used for Cobb angle measurement

3.1 Network Segmentation

Training the spinal column and vertebra segmentation networks took 51 min and 9 min, respectively. The spinal column CNN achieved 0.989 accuracy and 0.954 DICE, and the vertebra CNN achieved 0.945 accuracy and 0.910 DICE on their respective 10-image validation sets.

3.2 Cobb Angle Measurement

The automatic method reported 173/177 matching measurements, with 88% within clinical acceptance. The 4 missing curves were a result of vertebra segmentation failure, which was related to image quality or curve ambiguity. Among the 173 comparisons, 87% of the upper and lower vertebrae and 89% of the apical vertebrae from the curves were within ± 1 vertebral level of the manual measurements. This minor variation is usually accepted in clinical assessment. Figure 6 shows a histogram of the signed differences between automatic and manual Cobb angle measurements, with a maximum absolute difference of 16°. The MAD ± SD of the 173 comparisons was 2.8° ± 2.8°, and the manual and automatic measurements were strongly correlated (r = 0.989). Table 2 summarizes all relevant segmentation and angle measurement results. The automatic algorithm took 90 ± 41 s per image, which is comparable to manual measurements (90 s per image).

Fig. 6
figure 6

Distribution of automatic and manual Cobb angle measurement differences

Table 2 Summary of results for segmentation and automatic Cobb angle measurement

4 Discussion

4.1 Spinal Column and Vertebra Segmentation

Overall, both the vertebra and spinal column segmentation networks performed well given the relatively small number of training samples and the complexity of the task, although there are still a few challenges. The vertebra segmentation network appears to have learned to mark the segmentation edge wherever there are lines or high contrast regions. This is not always desirable, as vertebrae occasionally have extraneous bright lines or regions in the middle of the vertebra or around the vertebral edge, leading to false edge detection. Consequently, the bounding box could be incorrect and result in a poor Cobb angle measurement. Furthermore, since the algorithm is iterative, a poor segmentation on the current vertebra may induce improper segmentation of the following vertebrae.

To improve vertebra segmentation, there are several factors that can be changed. First, different preprocessing can be performed. While CLAHE is standard to improve contrast and feature extraction, another preprocessing algorithm may help reveal appropriate features, particularly by highlighting the vertebral edges without simply increasing contrast everywhere. Secondly, a custom loss function can be developed to train the network more efficiently, such as by including a shape-aware parameter [14] or incorporating a Hausdorff distance metric [15].

4.2 Cobb Angle Measurement

Compared to other literature, the developed algorithm performed admirably. Horng et al. tested their algorithm on 35 images and compared to two sets of manual measurements by an expert rater [6]. They had 91% within clinical acceptance on one set and 97% within clinical acceptance on the other. This is higher than the 88% demonstrated by our algorithm. However, all of their curves were under 20°. If our test set is limited to curves under 20°, 93% of the automatic measurements are within 5° of the manual measurements, indicating comparable performance in terms of clinical acceptability. Calculating the MAD ± SD of the provided measurements from Horng et al., their algorithm returned 3.0° ± 2.0° and 2.5° ± 1.7° between their automatic and two sets of manual measurements. Our results for curves under 20° improved upon this with a MAD of 2.3° ± 2.1°. The most significant improvement of the developed method is that it can be accurately applied to a wide range of curve severities. This is important since curves greater than 25° and 45° are recommended respectively for brace treatment and surgery.

Although the results are good, there is still room for improvement. As mentioned before, a poor vertebra segmentation could affect the Cobb angle measurement. The current algorithm only uses the angles of the spinal column midline and neighboring vertebrae as references to eliminate any outlying vertebral tilt angles from poor segmentations; however, this method does not always succeed. Further development can focus on a more robust algorithm that removes erroneous tilt angles every time; this can be achieved by considering the angle between the tops of the pedicles as an alternative to the bounding box angle. Clinicians sometimes use this instead of the vertebral endplates if the vertebral boundaries are unclear. If the pedicles are segmented, their angle can be used as additional information when calculating vertebral tilt for the Cobb angle.

5 Conclusion

This paper reported a robust iterative method based on convolutional neural networks to automatically measure the Cobb angles on PA radiographs for children with AIS. The MAD ± SD between manual and automatic measurements was 2.8° ± 2.8° and 88% of automatic measurements were within 5°, which is clinically acceptable error. Comparable accuracy and timing to manual measurements was achieved automatically, and thus the proposed method could be a valuable tool for reducing measurement variation and clinical workload. Further improvements can be performed by increasing the accuracy of the vertebra segmentation, especially on the vertebrae which contribute to a Cobb angle measurement.