1 Introduction

A common form of scoliosis affecting adolescents is adolescent idiopathic scoliosis (AIS). Lateral curvature of the spine is a medical condition. Spine curves are three-dimensional and can be described as “S” and “C” shapes. It has been observed that the state of some patients is not stable and that the curvature of the spine increases over time. There are usually no mild scoliosis-related problems, but severe cases can cause heart and lung-related problems. It is essential for the clinical evaluation and treatment planning of AIS to make an accurate quantitative assessment of spinal curvature [1].

Radiographic images of the anterior-posterior spine are a standard method to analyze AIS. There is an urgent need for advanced computerized methods that support physicians in diagnosing, planning therapy, and guiding interventional procedures in light of the growing volume of imaging examinations and the complexity of their assessment. To improve the accuracy and speed of treatment plans, machine learning algorithms should be used to improve the interpretability of the results. Machine learning algorithms are used to segment radiographic images for faster diagnosis, better interpretability, and noise removal. The process of segmenting an image involves dividing it into regions with similar characteristics. Machine learning algorithms such as deep convolutional neural networks (CNNs) are important for the analysis of medical images due to their automatic feature extraction capabilities. U-Net is one of the deep CNNs which focuses on segmenting small data sets without requiring complex hardware due to its special architecture. A combined loss function and modified architecture have been presented in this article based on the advantages of the U-Net network to improve the conventional U-Net network in the extraction of spine vertebrae in terms of edge detection, overlapping of spine vertebrae, and segmentation accuracy.

Considering the exact location of the vertebrae of the spine, the article is divided into the following sections: Sect. 2 reviews the literature review, Sect. 3 describes the general framework for the article and the proposed method, and Sect. 4 describes the dataset and results of the study.

2 Literature Review

Recent advances in deep learning have demonstrated the advantages of automatic spinal curvature assessment. Tavana et al. studied classical machine learning methods and pre-trained deep neural networks to classify the type of spinal curvature [2]. An ensemble voting approach was presented by Tavana et al. to improve the classification of spinal curvature types [3]. Wu et al. [4, 5], Sun et al. [6], and Galbusera et al. [7] extracted vertebrate landmarks and Cobb angles directly from spine images based on one-stage bottom-up approaches [8]. Zhong et al. automatically detected the deep learning-based Cobb angle from X-ray CT images [1]. Wang et al. first design a segmentation network that accurately segments the two spine boundaries, and the score-map is then used to input the original X-ray images of the spine into another angle estimation network so that Cobb’s angle can be predicted with high precision using regression [9]. A multi-view extrapolation network was used by Wang et al. [10] to predict the Cobb angle directly. In an article by Nicolaes et al. [11, 12], 3D convolutional networks were used to diagnose vertebral fractures from CT scans. According to Lin et al., radiographic images were used to segment the spine and estimate its curvature using a deep neural network called Seg4Reg [13]. Khanal et al. used two deep neural networks (Faster RCNN and Dense-Net) to identify vertebral and landmark locations and then fitted the landmark curves to estimate Cobb’s angle [14].

Due to the necessity of segmenting spinal vertebrae for better interpretation of radiographic images and for helping doctors plan more effective and accurate treatment, this study aims to discuss the improved architecture of the U-Net network as well as the combination loss function for more accurate spinal vertebrae segmentation.

3 General Framework

Segmentation is an intermediate step in image analysis [15], which involves segmenting an image into different parts with a strong correlation in the region of interest (ROI) within the image [9, 16]. In medical image segmentation [17], the aim is to represent a given input image in a meaningful manner that allows for the study of anatomy, the identification of the region of interest (ROI), and the development of treatment plans. Medical image segmentation assists in the analysis of medical images by highlighting the region of interest within the image [18].

Due to the method of preparing medical images, the type of pathology, and various biological changes, segmenting medical images is a challenging task [15]. To analyze medical images, a medical imaging specialist is required, but there are few of these specialists [19]. It has been demonstrated in recent years that deep learning networks have contributed to the development of newer image segmentation models with improved performance. In addition, deep neural networks are highly accurate in segmenting popular datasets [18]. A segmentation process involves assigning class labels to each pixel within an image. During image classification, only one label is assigned to each image while each pixel is assigned a class label during image segmentation. As a result, segmentation is prone to class imbalance problems. Classes with a large number of pixels can achieve high accuracy while classes with a small number of pixels are less accurate [20]. For the segmentation of medical images, U-Net [21] is a well-known convolutional neural network. A creative combination of a symmetric contracting path and an expansive path is employed in this network, which ultimately results in a more efficient and faster segmentation process. Moreover, it has been widely used for various special problems and has proven to be very effective in segmenting data [22]. In this paper, an improved architecture of the U-Net network and a combination of the dice loss function and the weighted cross-entropy loss function are presented to improve the performance of the conventional U-Net network. Figure 1 shows the general the framework of this paper.

Fig. 1
figure 1

The General framework of the proposed method

3.1 The Proposed Modified U-Net Network

In this paper, an improved U-Net is presented that focuses on segmenting spinal vertebrae. In this network, the architecture has been modified and expanded so that it requires very few training images and provides more accurate segmentation. Initially, the radiographic images were entered into the network as input and then in the end, the segmentation of the vertebrae of the spine was extracted from the network. In the right corner, the most important details about the U-Net network are indicated. As with the original U-Net, this network consists of a contracting path (left) and an expansive path (right), as well as a bottleneck [21].

However, batch normalization after each convolutional layer, as well as a 0.2 dropout layer after each convolutional block on the contracting path and before each convolutional block on the expansive path were utilized in this study. In the batch normalization process, the outputs of the convolutional layers are normalized to have a mean of zero and a standard deviation of one, and the dropout layer is used to deactivate some neurons of the hidden layer to prevent the network from overfitting. The other modification has been made to the number of filters in each convolutional block to make it more efficient. In the first convolutional block, there are 32 filters, which will double in the following four blocks and increase to 512 at the end. Figure 2 illustrates the modified network architecture in detail.

Fig. 2
figure 2

Proposed architecture for modified U-Net

3.1.1 Down-Sampling or Contracting Path

This path is composed of five blocks. The down-sampling path consists of.

• 2 × (Convolution Layer (3 × 3) with ReLU activation function and batch normalization).

• Max Pooling (2 × 2) and Drop out layer (0.2).

A contracting path is used to capture the semantics or context of the input image for its segmentation. By using convolutional and pooling layers, it extracts features that describe what is in an image.

There is a bottleneck between the expanding and contracting paths of the network. The bottleneck consists of two convolutional layers with batch normalization.

3.1.2 Expanding or Up-Sampling Path

The expanding path, also known as the decoder, consists of five blocks. In the up-sampling path, the following elements are present.

• Deconvolution layer with stride 2 and drop out layer (0.2)

• Concatenation with the corresponding copied feature map from the contracting path.

2 × (Convolution Layer (3 × 3) with ReLU activation function and batch normalization).

Convolution layer (1 × 1) at the end of the expansive path.

This expanding path retrieves the feature map size and adds spatial information for the segmentation image by using up-convolution layers. By using skip connections, course contextual information from the contracting path will be transferred to the up-sampling path.

It is possible to maintain the dimensions of the image by adding zero padding layers. The zero padding method adds zero rows and columns to the input matrix to control the size of the output feature map [23].

3.2 Loss Functions

In neural network training, the cost function plays a crucial role in adjusting the weights of a neural network to create a better-fitting machine learning model. In feedforward propagation, the neural network is run on training set data, and outputs are generated in the case of classification, indicating the probability or confidence in possible labels. By comparing these probabilities to the target labels, the loss function calculates a penalty for any deviation between the target label and the neural network’s output. The partial derivative of the loss function is calculated for each trainable weight during backpropagation. These partial derivatives are used to adjust the weights. Under normal conditions, backpropagation iteratively adjusts the trainable weights of a neural network to produce a model with a lower loss [24].

For segmentation, weighted cross-entropy loss is a loss function that classifies each pixel in an image, adding additional weight to adjust the importance of positive labels. However, the dice and intersection over union (IOU) losses are calculated using a ratio between the prediction result and the ground truth, which provides a measure of the overlap between the prediction result and the ground truth. That is, it predicts the entire image. Different loss functions can be applied to the U-Net network to predict segmentation results from different perspectives. With different loss functions on the modified U-Net network, it is expected that the output of Fig. 2 will be improved as a result of combining loss functions. As a result, the modified U-Net is trained to see the output of all these loss functions.

U-Net is the network mechanism used in the development of the model. The model is capable of segmenting the spine. As part of the study, the following loss functions have been explored for the improved U-Net network: binary cross-entropy, WCE, dice loss function, IOU loss function and finally, the combination of weighted binary cross entropy and dice loss.

3.2.1 Binary Cross Entropy

Generally, cross-entropy [25] refers to the difference between two probability distributions for a given random variable or set of events. Since segmentation is pixel-level classification, it is widely used for classification purposes [26].

$${L}_{BCE}\left(y\text{,}\widehat{y}\right)=-\frac{1}{n}\sum _{i=1}^{n}({y}_{i}\text{log}\left({\widehat{y}}_{i}\right)+\left(1-{y}_{i}\right)\text{log}\left(1-{\widehat{y}}_{i}\right))$$
(1)

The sum of these results occurs over the \(n\) pixels, and for each pixel, \(i\) represents the position of the pixel,\({y}_{i}\) shows the ground-truth value, and \({\widehat{y}}_{i}\) indicates the predicted value of the pixel [27].

3.2.2 Weighted Binary Cross-Entropy

A variant of binary cross-entropy is weighted binary cross-entropy (WCE) [28]. Positive examples are weighted by some coefficients. It is commonly used in cases of skewed data. [24]. Weighted cross entropy is defined as follows:

$${L}_{WBCE}\left(y\text{,}\widehat{y}\right)=-\frac{1}{n}\sum _{i=1}^{n}(\beta {y}_{i}\text{log}\left({\widehat{y}}_{i}\right)+\left(1-{\beta }_{ }\right)[\left(1-{y}_{i}\right)\text{log}\left(1-{\widehat{y}}_{i}\right)])$$
(2)

False negatives and false positives can be tuned with \(\beta\) values, e.g., if you want to reduce the number of false negatives, then set \(\beta\) > 1, similarly to decrease the number of false positives, set \(\beta\) < 1 [26].

3.2.3 Intersection-Over Union (IOU)

According to [29], the IOU loss can solve the imbalance between the two classes (foreground and background) in the segmentation problem. Its function \({L}_{IOU}\) is defined by:

$${L}_{IOU}=1-\frac{\sum _{i=1}^{N}\sum _{j=1}^{C}{y}_{i\text{,}j}{\widehat{y}}_{i\text{,}j}+\epsilon }{{\sum _{i=1}^{N}\sum _{j=1}^{C}{(y}_{i\text{,}j}+{\widehat{y}}_{i\text{,}j}-{y}_{i\text{,}j}{\widehat{y}}_{i\text{,}j})+\epsilon }_{ }^{ }}$$
(3)

This is a formula in which N is the number of pixels, C is the number of classes, and \(\varepsilon\) represents a smoothing constant that prevents the denominator from being zero.

3.2.4 Dice Loss

The dice loss is proposed in reference [30] to solve the medical image segmentation problem where the foreground occupies only a small region of the background. It is defined by:

$${L}_{Dice}=1-\frac{2\sum _{i=1}^{N}\sum _{j=1}^{C}{y}_{i\text{,}j}{\widehat{y}}_{i\text{,}j}+\epsilon }{{\sum _{i=1}^{N}\sum _{j=1}^{C}{(y}_{i\text{,}j}+{\widehat{y}}_{i\text{,}j})+\epsilon }_{ }^{ }}$$
(4)

3.2.5 Combo Loss

Some studies have combined distribution-based loss with a region-based loss for small diffuse structures segmentation because dice loss is unsuitable for small diffuse structures. Combo loss is defined as weighted sums of dice loss and WCE loss, to benefit from both Dice and WCE loss [31, 32], which is defined as follows:

$${L}_{WDSC}={-\alpha L}_{WCE}-(1-\alpha ){L}_{DSC}$$
(5)
$${L}_{WDSC}=\alpha \left(-\frac{1}{n}\sum _{i=1}^{n}\beta ({y}_{i}\text{log}\left({\widehat{y}}_{i}\right))+\left(1-{\beta }_{ }\right)[\left(1-{y}_{i}\right)\text{log}\left(1-{\widehat{y}}_{i}\right)]\right)-(1-\alpha )\frac{2\sum _{i=1}^{N}\sum _{j=1}^{C}{y}_{i\text{,}j}{\widehat{y}}_{i\text{,}j}+\epsilon }{{\sum _{i=1}^{N}\sum _{j=1}^{C}{(y}_{i\text{,}j}+{\widehat{y}}_{i\text{,}j})+\epsilon }_{ }^{ }}$$
(6)

\({L}_{WCE}\), \({L}_{DSC}\) are weighted cross entropy and dice loss functions, respectively, in formula 5. The hyper-parameter α can also be used to control weighted cross-entropy loss and dice loss.

This is also a pixel classification problem due to the spinal vertebral segmentation. The cross-entropy loss term was used to verify each pixel individually. The weighted cross-entropy loss, however, assesses each and every pixel. Vertebrae usually have a small surface area in anterior-posterior radiographs images. Thus, the segmentation network trained using a cross-entropy loss function is biased toward the background rather than the vertebrae. In addition to dice loss, combo loss is capable of handling input class imbalances, such as segmenting vertebrae from a background. Furthermore, the network can be penalized for false positives and negatives using the weighted cross-entropy loss term to force them to learn better parameters. Experimental results show that the combo loss function is more robust than the weighted cross-entropy loss function and dice loss function [33].

4 Experimental Results

In this section, the experiments and evaluation techniques used to test the performance of the proposed model have been presented. Tensor-Flow has been used as the backend along with the Keras Deep Learning Open Source Framework. All the experiments were conducted on an HP Pavilion Power laptop with the Intel(R) Core(TM) i7-7700HQ CPU @ 2.80 GHz processor. The rest of the hardware specifications of the laptop used for the experiment can be seen in Table 1.

Table 1 Hardware specifications of the computer used for training

4.1 Experimental Dataset

The spine data set is available at http://spineweb.digitalimaginggroup.ca/ [5] and has been used for segmentation. There are 609 anterior-posterior radiographic images in this data set. The four corners of each vertebra were extracted by two health professionals at the London Health Sciences Center. Each radiology image consists of 17 vertebrae, each defined by its four corners. The data set is divided into training and testing sections, with 481 images for the former and 128 images for the latter one (Fig. 3).

Fig. 3
figure 3

An example of an anterior-posterior dataset of the spine includes 17 vertebrae whose position is determined by the four corners of each vertebra (white points)

4.2 Pre-Processing and Data Augmentation

Each image should be pre-processed following the deep neural network. Resizing and normalizing were two crucial steps in this process. As a result of its defined architecture, radiological images are required for the input of the neural network. In the case of the modified U-Net network, the image should be resized and normalized with the network standard.

The training of a neural network requires a large amount of data. Due to the small and limited dataset, the parameters are weakened, and the learned networks are not well generalized. By using existing data, this problem can be partially addressed by data augmentation. The dataset includes 481 training images and 128 testing images. Data augmentation settings for the images are shown in Table 2. Finally, there are 102400 images have been obtained.

Table 2 Data augmentation techniques used in the proposed method

4.3 Label Setting

Since the learning process is supervised, image segmentation requires a label or ground truth. Two specialized physicians determined the four corners of each spine vertebra (white points in Fig. 4) in this dataset (for 17 vertebrae). Each vertebra’s label has been received through its four corners. In the modified U-Net network, additional information beyond the vertebrae is removed, leaving only the position of the vertebrae visible in the ground truth image (the spine vertebra range is marked in white, while the outside is marked in black).

Fig. 4
figure 4

Anterior-posterior radiograph of the spine and labeling of these images in the proposed network

4.4 Hyper-Parameter Tuning

This network architecture is based on the original U-Net architecture. As a result of this research, additional batch normalization and drop-out layers have been added to the network architecture, and the number of filters in each convolutional block has been changed. Therefore, it is necessary to train the network from scratch, for this purpose, input images and segmentation masks are used. Several experiments were conducted during the training process by tuning the hyper-parameters of the network.

As hyper-parameter values are selected in deep learning models, such as the learning rate and the number of filters, the optimal values for the given variables were also determined by the grid search to optimize results for the validation set (e.g., one round of cross-validation). The best result has been obtained in equal contributions (i.e., 0.5) of dice and weighted cross entropy terms. False positives need to be penalized more for the model to detect better vertebral boundaries and intervertebral distances (i.e.,\(\beta =0\text{.}75\)).In addition, the batch size is 64, the epoch is 50, and the drop-out is  0.2, which Adam optimization with a learning rate of 0.001 has been used. In Table 3, all the details of the proposed network architecture are presented.

Table 3 The tuning of hyper-parameters for the proposed network architecture

4.5 Performance Metrics

An output image is generated, separating vertebral areas from the background for each input image. The network output is also compared with a ground truth image.

The first criterion for evaluating the results’ accuracy is the total accuracy criterion. This indicates how well the network produced the output image according to the diagnosis of a radiologist:

$$Accuracy=\frac{TP+TN}{N}$$
(7)

As shown in Eq. 7, TP is the number of pixels that correspond to the vertebrae of the correctly detected spine (i.e., these pixels are the vertebrae of the spine both on the neural network’s output and the ground truth image). TN represents the number of correctly detected non-vertebral pixels, and N indicates the total number of pixels in the input image.

Since more non-vertebral pixels exist in the spine than vertebral pixels, the accuracy criterion may not be accurate in this case. Precision, recall, and dice similarity coefficient are therefore used.

$$Precision=\frac{TP}{TP+FP}$$
(8)
$${\text{Recall = }}\frac{{{\text{TP}}}}{{{\text{TP + FN}}}}$$
(9)
$$DSC=\frac{2TP}{2TP+FP+FN}$$
(10)

In Eq. 8, 9, and 10, TP represents the number of correctly detected vertebrae pixels. The FP value indicates the number of non-detected spinal non-vertebra pixels (i.e., in the neural network output, the pixels are vertebrae, whereas, in the ground truth, they are non-vertebrae). FN represents the number of vertebrae pixels that failed to be identified (that is, the output of the neural network represents the pixels of the non-vertebrae, whereas the ground truth represents the pixels of the vertebrae).

4.6 Results

As part of the training process, parameters and hyper-parameters were tuned to test and evaluate the performance of the proposed network. As with the training dataset, the test dataset is pre-processed and normalized. As shown in Fig. 5, the proposed loss functions result in the following outputs. The size of all the images in Fig. 5 is the same.

Fig. 5
figure 5

The input, ground truth, and output of the improved U-Net network for vertebrae segmentation are represented according to the used loss functions; including a input, b ground truth c output with binary cross-entropy loss function, d output with weighted cross-entropy loss function, e output with IOU loss function, f output with Dice loss function, g output with WDSC (the proposed) loss function, loss functions for the improved U-Net network

According to the proposed loss functions, Table 4 displays the accuracy, precision, recall, and Dice similarity coefficient for the improved U-Net network for spine vertebrae segmentation and compares this network with conventional U-Net [21], MultiResNet [34], Pre-trained Mask-RCNN101 [35], U-Net++ [36], and Dense-U-net [37].

Table 4 Comparing the improved U-net with conventional U-net, pre-trained Mask-RCNN, MultiResNet, U-Net++, and dense U-net using the different loss functions for segmentation of spinal vertebrae

As shown in Fig. 5; Table 4, the improved U-Net network has a better architecture for spine vertebrae segmentation than that of conventional U-Net [21], MultiResNet [34], Pre-trained Mask-RCNN101 [35], U-Net++ [36], and Dense U-Net [37] networks. As the number of filters in each convolution block changes, the number of convolutional blocks increases, batch normalization is applied to the convolutional blocks, and drop-out layers are added after max-pooling and up-convolution layers to segment the spine vertebrae. In addition, the combo loss function (\({Loss}_{WDSE})\), which is made from the combination of the dice loss function and the weighted cross entropy loss function with \(\alpha\) value, has a significant effect by tuning false positives and false negatives with the \(\beta\) value in the performance as well as accurately detecting the vertebrae’s position.

5 Discussion and Conclusion

There is a significant increase in spinal abnormalities as a result of changing people’s lifestyles. Scoliosis is a spinal deformity characterized by an abnormal structure in the spine. The gold standard for diagnosing spine abnormalities is radiographic images. A large number of patients, the time-consuming examinations, and the small number of doctors call for the use of machine learning algorithms to help doctors and speed up evaluations. This article uses the improved U-Net neural network for better interpreting radiographic images and tries to improve the segmentation results by combining the loss function with improved U-Net.

As the IOU function loss, false positive predictions have a lower error rate than false negative predictions. The probability of receiving a false positive prediction is higher for a category with few pixels, so the network might settle for false negatives. False negatives and false positives are penalized less by the Dice Loss than by the IOU Loss, and the difference between false positives and false negatives is smaller. Accordingly, the imbalance in probabilities should be compensated slightly, leading to improved performance. The 2-class penalizes all errors from both perspectives, resulting in a smaller difference between false negatives and false positives. Additionally, weighted cross-entropy loss is used for classifying pixels in images, which also adjusts positive labels’ importance based on additional weight. Dice loss functions combined with weighted cross-entropy have a positive effect on the network’s final results, and the network can easily extract the spine vertebrae for the doctor to evaluate.