1 Introduction

The health issue is essential in today's world with approximately 8 billion individuals. A healthy musculoskeletal structure and the absence of any abnormalities in it is a critically important indicator of health (Foley 2013). The effects of musculoskeletal disorders on an individual's appearance and beauty may result in mental and emotional issues. These abnormalities may lead to heart and lung disease and slight imbalances for the patient over time, causing joint erosion and muscle cramps that result in pain, decreased spinal mobility, nervous system disorders, and asymmetry. These disorders often result in a limp on both sides of the body and may even lead to death. Therefore, doctors need to diagnose this disease as early as possible.

Adolescent Idiopathic Scoliosis (AIS) is one of the spinal disorders. Generally, this condition is characterized by a disorder in the spine structure, resulting in an abnormal body weight distribution. It is one of the most common abnormalities and deformities of the spine in the world (Driscoll et al. 2009; Han et al. 2015; Komeili et al. 2014). There are two types of scoliosis: C-shaped and S-shaped. Scoliosis C-shape occurs when the spine bends in one direction with a C-shaped curve, resulting in a C-shaped curve throughout the spine (Fig. 1). Scoliosis of this type can manifest in several ways: 1. the lumbar curve occurs in the lower back. 2. The thoracolumbar curve begins at the upper area of the waist and ends at the lower area of the waist. 3. The thoracic curve is the type of curvature scoliosis that occurs in the upper part of the spine.

Fig. 1
figure 1

The anterior–posterior images of the spine with a C-shaped curve (a, b) and the anterior–posterior images of the spine with an S-shaped curve (c, d)

In S-shaped scoliosis, the spine curves in two places—one at the top of the chest and the other at the waist. Interestingly, when these curves are in the opposite direction, they make the spine resemble the letter S (Fig. 1). Since these curves may balance each other at different times, diagnosing types of scoliosis may be more challenging (Fig. 2).

Fig. 2
figure 2

It is challenging to determine the type and severity of spinal curvature from scoliosis X-rays due to high ambiguity and variability (aL)

In addition, the diagnosis of the type of spinal curvature is often influenced by many factors such as the selection of vertebrae, observer bias, and image quality. When evaluating the type and progression of scoliosis, such variables have a significant impact on the diagnosis. It can also be difficult to detect the severity and type of curvature due to the high level of X-rays-related ambiguity and variability. As shown in Fig. 2, large anatomical variability and low tissue contrast may make identifying the interesting vertebrae difficult. Accordingly, it may be difficult to detect the type of curvature in the spine, and the patient may eventually go through a more complicated stage of the disease, which will require braces as well as a more complex surgical procedure. As a result, determining the type of abnormality and its severity is very critical both for the doctor and the patient (Chen et al. 2019).

Therefore, depending on the type, shape and severity of the deviation (C-shaped or S-shaped), sports techniques, the type of brace, and, in severe cases, surgery specific to the same kind of deviation are considered. Consequently, diagnosing the type of spinal curvature plays a significant role in developing spinal evaluation systems that will assist the physician in formulating a treatment plan for the scoliosis patient.

Radiograph images, spiral CTs, and MRIs of the spine may be used to examine the spinal structure. The gold standard for diagnosing this disease is currently a spinal radiograph images (Cherian et al. 2005; Franquet 2001).

The private dataset used in this study consists of two classes. Based on the type of curvature of the spine, this dataset has been classified by two specialists. There is a challenge in classifying X-ray images (radiographs) due to their visual similarity between the different classes. A set of features plays a significant role in image classification. The following section discusses a collection of various features that other researchers have used to classify medical images (A Mueen et al. 2007).

The classification of images is the process of assigning them to predefined classes. Machine learning algorithms can perform image classification in both a classical and an automatic method. There are some basic principles underlying the classical method, such as representation in which the visual feature of the image is extracted and used as input by the classifier model. In contrast, in automatic machine learning algorithms, such as deep neural networks, there is no need to extract features and represent images (Zare et al. 2013).

In recent years, researchers have developed automated algorithms to analyze radiographic images due to their complexity of interpretation. The first articles describing an automatic anomaly detection system on radiographic images were published in the 1960s (Bhattad and Jain 2020).

Deep learning has become the preferred technique for image analysis tasks in recent years, and it has made a tremendous impact in medical imaging. It is well known that deep learning is very data-hungry. Recent years have witnessed the release of numerous labeled radiographic image databases primarily produced by automated radiology report analysis. The radiographic imaging community has greatly benefited from such databases.

Among deep learning techniques, Convolutional Neural Networks (CNN) demonstrate to significantly improve image segmentation and classification and provide a rapid diagnosis for developing assessment systems to assist doctors.

Large amounts of labeled data are usually required to train Convolutional Deep Neural Network (DNN) models. Transfer learning has successfully applied deep neural networks to real-world problems with a small dataset that traditional machine learning cannot solve, including image processing (Niu et al. 2020; Ravishankar et al. 2016).

This paper aims to classify the types of spine curvature (C-shaped and S-shaped) using classical machine learning algorithms, namely SVM and KNN, and automatic methods of deep convolutional neural networks. Classical machine learning methods for image classification require feature extraction and image representation. To improve the performance of deep neural networks, SVM has been used here as the final activation function, which prevents the vanishing gradient. In addition, it aims to help doctors diagnose patients quickly and accurately with the minimum amount of radiation in the classification of spinal curvatures.

This paper is organized as follows: Sect. 2 reviews the related literature. Section 3 presents the proposed approach in more detail. Section 4.2 provides a review of methods and experiments and experimental datasets. Lastly, Sect. 5.2 is the conclusion section.

2 Literature review

The first component in any classical classification system is the representation of images. There are two main approaches to image representation: (i) low-level image representation and (ii) patch-based image representation (Zare et al. 2013).

Several researchers have employed low-level image representations in various applications (Chang-ming et al. 2008; Deselaers and Ney 2008; Dimitrovski et al. 2011; Kim et al. 2010; Ahmed Mueen et al. 2008; Mustaffa et al. 2008; Sohail et al. 2010; Tian et al. 2008; Tommasi et al. 2008). Kim et al. proposed a method for classifying medical X-ray images that use LBP to extract texture information from the x-ray images, finding that the classification accuracy obtained was better than that of others who used edge histogram descriptors (Kim et al. 2010). Combining a block-based local binary pattern with an edge histogram was used as a medical image representation for Image-CLEF 2007's task of automatic medical image annotation (Tian et al. 2008).

According to Chang-ming et al., the segmentation method was applied to ultrasound medical images based on the texture features obtained from the GLCM (Chang-ming et al. 2008). As a feature extraction method, GLCM was combined with histogram moments to classify ultrasound medical images (Sohail et al. 2010). However, combining different local and global descriptors into one unique feature representation can be seen as another widely used approach. As a global image descriptor, pixel values were combined with other image representation techniques to construct a feature vector to describe the image (Deselaers and Ney 2008; Dimitrovski et al. 2011; Ahmed Mueen et al. 2008; Tommasi et al. 2008; Zare et al. 2013). In several studies, image representations based on local patches have been studied. In the local patch-based image representation method, Bag of Words (BoW) is used to represent images as histograms of quantized appearances of local patches.

Several studies have utilized this feature in classifying medical images (Avni et al. 2010a, b; Deselaers et al. 2006). Because of the increasing size of medical X-ray archives, it is essential to have simple, discrete representations and simple matching methods to ensure computer efficiency. A visual word dictionary was generated to represent the X-ray chest images used in another paper (Avni et al. 2010a). Using a "bag of visual words" approach, (Avni et al. 2010b) proposed classifying and retrieving local patch representations-based X-ray images. The best performance was obtained using dense sampling of features and a kernel-based Support Vector Machine (SVM) based on nonlinear kernels. In Deselaers et al. (2006) extracted features based on local patches of different sizes, which were measured at every position and then scaled down to a common size (Zare et al. 2013).

Researchers and physicians have been considering deep neural networks as an automated machine learning system in recent years since they have the advantage of automatic feature extraction, which helps diagnose and classify diseases in medicine (Szegedy et al. 2015, 2016). Studies have been reported in the articles (Pratt et al. 2016; Shuangling Wang et al. 2015), diagnoses of diabetic retinopathy (DR) or ocular diabetes (Giancardo et al. 2012; Niemeijer et al. 2009; Rahim et al. 2016; Reddy et al. 2018; Vo and Verma 2016), in (Tajbakhsh et al. 2015b; Zhang et al. 2016) diagnoses polyps during colonoscopy, in (Tajbakhsh et al. 2015a) correctly diagnoses Pulmonary embolism in CT scan images, in (Zheng et al. 2015) classify lung diseases, in (Rajinikanth et al. 2022) lung pneumonitis is diagnosed with the InceptionV3 network (Rajinikanth et al. 2022) demonstrated the lung segmentation performance of the U-Net scheme with a one-fold and two-fold training process. Experimental verification of the proposed scheme is conducted by segmenting the lung section from chest X-rays in (Nagi et al. 2022). This research involves an evaluation of the performance of deep learning models for COVID-19 diagnosis using chest X-ray images from a dataset containing the largest number of COVID-19 images ever used in the literature, according to the best of the author’s knowledge. In (Mohammed et al. 2021) an automatic prediction of COVID-19 identification is proposed to automatically discriminate between healthy and COVID-19-infected subjects in X-ray images using two successful modern are traditional machine learning methods (e.g., artificial neural network (ANN), support vector machine (SVM), linear kernel and radial basis function (RBF), k-nearest neighbor (k-NN), Decision Tree (DT), and CN 2 rule inducer techniques) and deep learning models (e.g., MobileNets V2, ResNet50, GoogleNet, DarkNet, and Xception) and in (Dubost et al. 2019; Khanal et al. 2019; Lin et al. 2019; Wang et al. 2019a, b; Zhao et al. 2019) study spinal segmentation and measurement of Cobb angle by a deep neural network.

According to Lin et al., Seg4Reg is a framework that consists of two deep neural networks for segmenting the spine and measuring Cobb angles (Lin et al. 2019). Based on the deep neural network, Zhao et al. used a top-down approach to determine spine curvature automatically (Zhao et al. 2019). According to Khanal et al., an automated method using the Faster RCNN network detects vertebrae as objects, followed by a landmark detector using Dense-Net to estimate the four landmark corners of each vertebra separately. Cobb Angles were calculated using the slope of each vertebra obtained from the predicted landmarks (Khanal et al. 2019). Chen et al. presented a new adaptive error correction network (AEC-Net) to formulate the estimation of Cobb angles from spine X-ray images as a high-accuracy regression task. AEC-Net consists of two networks that calculate landmarks and Cobb angles separately, which robustly solves the disadvantage of ambiguity in X-rays since these networks focus on more features (Chen et al. 2019). Wang et al. proposed a multi-view extrapolation network (MVE-Net) that provides accurate automated scoliosis estimation in multi-view (both AP and LAT) X-rays (Wang et al. 2019a). According to the studies conducted for the automatic spine assessment articles, the only focus is on measuring the Cobb angles of the spine.

Studies indicate that there is no article regarding the classification of spinal curvature type (C-shaped and S-shaped). Due to the biased observer and the poor quality of the radiographic images, determining the type of curvature is a challenging task. Because there are few doctors, spinal deformity assessment takes considerable time, and as this disease progresses, the patient experiences more severe stages of the disease, accordingly. Thus, automated methods of diagnosing and categorizing the curvature type and severity of curvatures are becoming increasingly important to aid patients as quickly as possible to provide them with the best possible care.

Although there is an importance and necessity for the automatic detection of the type of curvature of the spine, no research has been conducted to date in this area. The objective of this study was to automatically detect the type of curvature using machine learning techniques for the first time. In this investigation, the traditional machine learning method, including support vector machines (SVM) and K-nearest neighbor (KNN) in front of the pre-trained deep networks, are used, studied, and compared.

The classical machine learning method first extracts the feature vector from the images, as shown in Fig. 3a. Following the training process, classification is performed. In contrast, in the deep learning method (Fig. 3b), the images are fed directly into the proposed pre-trained deep networks. After fine-tuning the networks, the image classification is carried out automatically.

Fig. 3
figure 3

A classification framework based on SVM and KNN machine learning algorithms and pre-trained networks

3 The proposed general approach

Automated image classification is an active field of research in pattern recognition and machine learning. An image classification method based on supervised learning is used here. There are two steps in the classification process: training and testing. It is well known that in supervised learning, a training set is given with a category label and a set of image features. The learning objective is to compute the classifier from the training set and create the classification model. During the test step, this model is used to classify new images that are not labeled to determine the model's accuracy (Mueen et al. 2007). Figure 3 illustrates the proposed classification training process.

3.1 The image processing module (image enhancement and feature extraction)

The training step in the Support Vector Machine (SVM) and the K-nearest neighbor (K-NN) algorithms includes two modules of image processing and model production. As shown in Fig. 3a, an image processing module consists of several components, including image enhancement and feature extraction. Image enhancement methods are used to improve the quality of the images. In classical machine learning, feature extraction methods are used to determine the appropriate model inputs.

The image classification process begins with extracting appropriate features from the images. To determine image classification accuracy, extracting a set of features from the image is primarily necessary. Adding more distinctive characteristics will improve the classification result. It is possible to extract features at three levels: global, local, and pixel. An image's most superficial visual properties are directly related to pixel values. As a result of scaling the images to a common size, the Euclidean distance and distortion model is used to compare them (Keysers et al. 2004). In this process, local features are extracted from small sub-images from the original image. The global features can be extracted to describe the whole image in an average way. Color, texture, and shape are the low-level features extracted from images and local patches (Ma and Zhang 1998; Mueen et al. 2007).

3.1.1 Image enhancement

Histogram equalization is one of the image enhancement techniques applied to improve the quality of the image, which in turn increases the contrast of the image. As a result of the contrast adjustment, the gray intensities distribution on the histogram is improved. This method is appropriate for images with bright or dark backgrounds and foregrounds. Notably, the technique can enhance the interpretation of X-ray images of bone structure (Ahmed Mueen et al. 2008; Zare et al. 2013).

3.1.2 Feature extraction

Feature extraction is the second component of the image processing module. It plays a crucial role in the performance of any image classification system since it can impact the classification results. However, since X-ray images are gray-level and do not contain color information, the related Content-Based Image Retrieval (CBIR) systems deal primarily with textures and shapes to extract features. This study uses various methods to extract features from images, including texture, shape, pixel value, and the bag of words, as illustrated in Fig. 4 (Zare et al. 2013).

Fig. 4
figure 4

Feature extraction methods for radiographic images

3.1.3 Texture features

Texture features can be used to distinguish categories of medical images with distinct textures. The texture contains information essential to comprehend the underlying structural arrangement of the surfaces within an image. In this study, the commonly used feature extraction techniques, Gray Level Co-occurrence Matrix (GLCM) (Haralick et al. 1973) and Local Binary Patterns (LBP) proposed by Ojala et al. (1996), were used to extract texture features (Zare et al. 2013).

As one of the well-known texture extraction techniques, GLCM is a technique that assesses second-order texture characteristics. In other words, the gray level coordinate matrix, or GLCM, of an N × N image, which consists of pixels with gray levels 0, 1, 2, … G − 1, is a matrix \(C(i\text{,} j)\) which depicts the probability of co-occurrence of intensity levels i and j at a given distance and angle θ. The four occurrence matrixes of GLCM are presented from four different directions \((\theta \epsilon \{{0}^{^\circ }\text{,} {90}^{^\circ }\text{,}{45}^{^\circ }\text{,}{135}^{^\circ }\})\) at a global level (Zare et al. 2013).

LBP is another important texture feature used in this study. The original LBP describes only global texture information. On the other hand, local features are more robust to occlusion and clutter. The image was divided into 4 × 4 or 2 × 2 non-overlapping sub-images to produce this result. Then, each sub-image was further divided into a non-overlapping square image block with the same size as the 'neighbor set' itself. The LBP distribution can be represented in each sub-image as a histogram by combining the local histograms for each square block. The final histogram of each sub-image is then generated by combining local histograms (Zare et al. 2013).

3.1.4 Shape features

The shape of an object in an image provides geometrical information which remains the same despite the location, scale, and orientation changes. Accordingly, the shape information of an image is described by the edges of the image. For each image and image patch, a histogram of the edge directions is used to represent the shape attribute. The Canny edge operator (Ojala et al. 1996) is used to generate edge histograms. The algorithm of the Canny edge detector is divided into five stages:

  1. 1.

    Smoothing: The process of removing noise using a Gaussian filter.

  2. 2.

    Finding gradients: Finding the gradient of the intensity of an image using finite-difference approximations for partial derivatives to calculate gradient magnitude and orientation.

  3. 3.

    Suppression of local maxima: Only local maxima should be marked as edges. The non-maxima suppression will be applied to the gradient magnitude to obtain these results.

  4. 4.

    Double thresholding: Potential edges are determined by the thresholding process.

  5. 5.

    Edge tracking by hysteresis: Final edges are determined by suppressing all edges that are not connected to a very specific (strong) edge (Zare et al. 2013).

3.1.5 Pixel value

The simplest method for representing images is the pixel value method. A feature vector contains the values of image pixels in an algorithm that scales down the image to a common size and represents it as a feature vector containing the image pixels. It is argued that image pixel values do not give satisfactory classification or retrieval results. When comparing two images, it is not always easy to determine which two pixels should be evaluated. Nevertheless, this method is a reasonable basis for classifying and retrieving simple images with few objects, such as medical X-rays and character recognition (Keysers et al. 2007; Zare et al. 2013).

3.1.6 Combining visual features

One thousand images (800 training images and 200 test images) exist in the dataset. The images have all been scaled down to 100 × 100 pixels. A global feature vector combines texture and shape features into a 53-dimensional feature vector (16 textures + 37 edges). In experiments two and three, local features are obtained by dividing the images into sixteen and four pieces, respectively, without overlapping. The original image has been resized to 15 × 15 for pixel-level information, and a feature set has been created with 225 features. The outcome of experiments 1, 2, and 3 have been 278, 944, and 750 dimension features, respectively (Fig. 5).

Fig. 5
figure 5

The three-level features-based classification (Mueen et al. 2007)

3.1.7 Bag of words

The process of BoW is initiated with the detection of local interest points. These point detectors should find specific points and regions in an image invariant to certain geometric and photometric transformations. One of the popular approaches for detecting local interest points, which was applied in this experiment, is the Difference of Gaussians (DoG). The DoG detector proposed by Lowe (1999) is designed to be invariant to translation, scale, rotation, and illumination changes and to be able to sample images from different locations and scales (Zare et al. 2013).

Secondly, some distinctive features are extracted that characterize an image's set of key points. For a bag of visual words to function properly, the next step is constructing a codebook, which requires that 128-dimensional local image features be quantized into discrete visible words. A clustering algorithm or a vector quantization algorithm is used for this task. Usually, the k-means clustering method is used in this step, along with the cluster center as a visual vocabulary term. Following the identification of cluster centers, each image is presented as a histogram of those cluster centers by simply counting the frequency with which the words appear in the image. Every feature vector within an image is assigned to a cluster center to accomplish this task by giving it a Euclidean distance to its nearest neighbor (Zare et al. 2013).

3.2 Model generation module

After the image processing module, the next module in the process is the model generation module to detect the type of curvature of the spine (C-shaped or S-shaped). The training set, the test set, and the label of each image are specified. After extracting the features from the training set, the extracted features are entered into the classifier along with the label of each image in the dataset to construct a classification model. The Support Vector Machine (SVM) and K-nearest neighbor have been classified better than other classification techniques based on experimental results and several applications in this field (Eljamassi and Maghari 2020; Hasoon et al. 2021; Zare et al. 2013). As a result, these two methods are used in this study to investigate the problem.

3.2.1 Support vector machine (SVM)

The SVM approach is exceptionally appealing for image classification because it produces a hyperplane when separating relevant and irrelevant vectors and thus maximizes the margin between them. Optimal hyperplanes (Fig. 6) have the most significant margin toward sample objects; the greater the margin, the less likely any feature vector will be misclassified (Wasule and Sonar 2017; Zare et al. 2013).

Fig. 6
figure 6

Optimal hyperplane margin (Wasule and Sonar 2017)

3.2.2 K-nearest neighbors

The KNN classifier is the simplest method to classify data in machine learning. A KNN classification is based on the majority vote among the neighboring classes expressed as k-nearest neighbors. Classification can be achieved by identifying the neighbors closest to the query example and using those neighbors to determine the class to which the query belongs (Govil and Chugh 2015).

3.3 Transfer learning module

The pre-trained mobileNetV2 and Pre-trained Xception modules are used in Fig. 3b. The MobileNetV2 and Xception are networks built using convolutional neural networks (CNNs). In recent years, CNN has achieved excellent results in object recognition and image classification, which is one of the most established algorithms among the various models of deep learning networks (Krizhevsky et al. 2012; Russakovsky et al. 2015).

3.3.1 Convolutional neural network (CNN)

Convolutional neural networks (CNNs) are a subset of artificial neural networks. Like traditional neural networks (TNNs), CNNs consist of neurons, layers, and weights. The most significant difference between CNNs and TNNs is that CNNs have unique ability to detect and classify images. That is, the input to deep neural networks can include information matrices with higher dimensions(Krizhevsky et al. 2012).

The CNN architecture has several building blocks, such as convolution, pooling, and fully connected layers. An architectural structure typical of the process uses a stack of several convolution layers and a pooling layer, followed by one or more fully connected layers. As a result of these layers, input data are transformed into output data, known as forward propagation (Fig. 7) (Yamashita et al. 2018).

Fig. 7
figure 7

A representation of the architecture and training process of a convolutional neural network (CNN) (Yamashita et al. 2018)

3.3.2 Training-based transfer learning

In medical imaging, it is desirable to have an abundance of well-labeled data, but it is not readily available because of the cost and the load of radiologists. Two techniques can be used to train a model on a smaller dataset: data augmentation and transfer learning. A common and effective strategy for training a network on a small dataset is transfer learning, which involves pre-training a network on a vast dataset such as the ImageNet dataset, which contains 1.4 million images with 1000 classes, and then reused to apply to a specific problem (Yamashita et al. 2018). While this method reduces the cost of training models, it also enhances the effectiveness of machine learning in a significant way. As a result, transfer learning can enable people to handle some new or different application scenarios when there is not enough labeled data available to allow machine learning to be performed (Xiang et al. 2019).

3.3.3 Pre-trained MobileNetV2

A group of Google researchers developed Mobile-Nets as a small and low-power model (Sandler et al. 2018). The MobileNetV2 is an advanced version of the MobileNetV1. It uses depth-wise separable convolutions as efficient building blocks. The architecture has two types of blocks: residual blocks with a stride of one, and downsizing blocks with a stride of two. In addition, it has linear bottlenecks between the layers, which should be avoided by using linear layers to prevent nonlinearities from damaging a large volume of data. These bottlenecks in the model can help by encoding intermediate inputs and outputs. As a result of the inner layer, lower-level concepts such as pixels are transformed into higher-level descriptors such as image categories. Additionally, bottlenecks are connected by shortcuts. As shown in Fig. 8, a pre-trained MobileNetV2 was used (Hashmi et al. 2020). In this network, SVM is used instead of the sigmoid activation function to improve the accuracy of the classification as well as prevent the vanishing gradient in the last layer of the network.

Fig. 8
figure 8

The MobileNetV2 architecture consists of bottlenecks and depth-separable convolutions (Hashmi et al. 2020)

3.3.4 Pre-trained Xception

Francois Chollet proposed the Xception model (Chollet 2017) which is an extension of the Inception architecture and uses modified depth-wise separable convolutions instead of the standard Inception modules, which makes it better for both the ImageNet ILSVRC and JFT datasets. This modified depth-wise separable convolution initiates with a pointwise convolution and ends with a depth-wise convolution. Res-Net architecture also proposes residual/skip connections (He et al. 2016). There are three blocks in the Xception architecture: entry flow, middle flow, and exit flow. In addition, eight repetitions of the middle flow block are available. A modified depth-wise separable convolution in the architecture has no intermediate non-linearity. As shown in Fig. 9, the pre-trained model of Xception has been used (Hashmi et al. 2020). To improve the performance and increase the accuracy of the network in the classification of spinal curvature types, SVM is used as the activation function of the last layer.

Fig. 9
figure 9

An Xception architecture comprises three blocks: the entry flow block, the middle flow block, and the exit flow block (Hashmi et al. 2020)

4 Materials and methods

In this section, several experiments have been presented that analyzed the performance of pre-trained mobileNetV2 and pre-trained Xception networks classification against SVM and KNN methods using feature extraction methods originating from related radiographic image classification articles (Mueen et al. 2007; Zare et al. 2013).

In this article, 1000 anterior–posterior radiographs of the spine with C-shaped and S-shaped scoliosis are used. This private dataset was collected from the Shafa Hospital in Tehran, Iran, and labeled by two medical specialists. In this dataset, the anterior–posterior images of the spine relate to different genders, age groups, and geographical locations who have visited the hospital with scoliosis in the last 20 years and represent the population of patients with scoliosis.

The radiographic images included in this dataset are anterior–posterior images of the spine, which include 17 vertebrae. Such images should be standardized and normalized before usage. For this purpose, the images are improved, which equalized the histogram and adjusted its resolution and size considering to the algorithm and network.

The dataset includes 500 anterior–posterior C-shaped radiographic images and 500 anterior–posterior S-shaped radiographic images (Table 1). Eighty percent of the dataset is the training set, and twenty percent is the test set. Figure 10 illustrates specimens of the private dataset, and Table 1 describes its characteristics.

Table 1 Describing the experimental private dataset
Fig. 10
figure 10

The specimens of the private dataset used to classify C-shaped and S-shaped curvatures (ad)

The purpose of this article is first to examine the accuracy of classifying spinal anomalies into C-shaped and S-shaped classes by SVM and KNN algorithms using different methods of image representation. Then, the performance of these classical algorithms is analyzed using the two pre-trained MobileNetV2 and pre-trained Xception networks. A general outline of the procedure is shown in Fig. 11.

Fig. 11
figure 11

A general description of the experiments in Experiments 1 to 4

4.1 The first experiment

A diagram of the first experiment is shown in Fig. 12. This experiment involves the following steps.

  1. 1.

    Equalization of histograms.

  2. 2.

    The texture (co-occurrence matrix) and shape features are extracted as global features from each image (16 textures + 37 edges).

  3. 3.

    The original image is resized to have a 15 × 15 pixel resolution to get the pixel vector.

  4. 4.

    All the above-extracted features are combined into one vector. The total dimensionality of the feature vector for this experiment reached 278 (225 + 53 = 278).

  5. 5.

    A SVM was tested using the Radial basic function (RBF) kernel and the KNN with K values of 3, 5, 7, 9, 11, 13, and 15, and the best results were obtained without the under-fitting or over-fitting model.

Fig. 12
figure 12

A schematic representation of the first experiment (Zare et al. 2013)

4.2 The second experiment

A diagram of the second experiment is shown in Fig 13. This experiment involves the following steps.

Fig. 13
figure 13

A schematic representation of the second experiment (Zare et al. 2013)

  1. 1.

    Equalization of histograms.

  2. 2.

    A local binary pattern (LBP) is extracted from each sub-region and then combined at the end. Since the image consists of 16 sub-images, 59 × 16 = 944 histogram bins are generated for the image containing the feature vector.

  3. 3.

    A SVM was tested using the RBF kernel and the KNN with K values of 3, 5, 7, 9, 11, 13, and 15, and the best results were obtained without the under-fitting or over-fitting model.

4.3 The third experiment 

A diagram of the third experiment is shown in Fig 14. This experiment involves the following steps.

Fig. 14
figure 14

A schematic representation of the third experiment (Zare et al. 2013)

  1. 1.

    Equalization of histograms.

  2. 2.

    The texture (co-occurrence matrix) and shape features are extracted as global features from each image (16 textures + 37 edges).

  3. 3.

    The original image is resized to have a 15 × 15 pixel resolution to get the pixel vector.

  4. 4.

    All of the above-extracted features are combined in one vector. As a result, the overall dimensions of the feature vector for this experiment were 278 (278 = 53 + 225).

  5. 5.

    Each image is divided into four non-overlapping sub-regions. The LBP is extracted from each sub-region.

  6. 6.

    Finally, adding the vectors to a single vector, the feature vector reaches 750.

  7. 7.

    A SVM was tested using the RBF kernel and the KNN with K values of 3, 5, 7, 9, 11, 13, and 15, and the best results were obtained without the under-fitting or over-fitting model.

4.4 The fourth experiment

The BoW is extracted from training images and fed into SVM and KNN classifiers to construct a classification model. Finally, A SVM was tested using the RBF kernel and the KNN with K values of 3, 5, 7, 9, 11, 13, and 15, and the best results were obtained without the under-fitting or over-fitting model.

4.5 The fifth experiment

The two pre-trained MobileNetV2 and pre-trained Xception networks are used in the fifth experiment to compare the accuracy and power of deep neural networks against classical machine learning methods. Figures 8 and 9 show a diagram of the procedure and details for the fifth experiment, and the steps for the experiment are as follows:

4.5.1 Pre-processing and data augmentation

For a pre-trained network to work, radiographic images are used as input. It is essential to pre-process each dataset according to the used deep neural network. Resizing and normalizing are two crucial steps in this field. These images must be resized to make them appropriate for the pre-trained MobileNetV2 and Xception networks. To enter the pre-trained MobileNetV2 network, images are first resized to 224 × 224, and to enter the pre-trained Xception network, images are resized to 299 × 299. The pixel intensity is transferred to the [0, 1] range and then normalized.

The training of a neural network requires a large dataset to achieve the desired results. With a small dataset, the parameters of the networks weaken, and the trained networks have a feeble generalization ability. Furthermore, noise may affect the accuracy of classification. Generally, data augmentation is the solution usually used to resolve this problem. For this purpose, data augmentation is used to increase the amount of training data and ensure robustness against noise attacks (Kim et al. 2021). In this dataset, five hundred images have been labeled as C-shaped, and five hundred images have been labeled as S-shaped. The data augmentation process produces 24,000 images. Table 2 below shows the settings used in data augmentation.

Table 2 The data augmentation technique used in the proposed method

Subsequently, the data is divided into training and test sets. About 80% of the data is contained in the training set, and the rest is included in the test set. The training is done on the training set, and the built model is evaluated on the test set (Table 1).

4.5.2 Fine-tuning of the networks

In Sects. 3 and 4.2, all details of the pre-trained MobileNetV2 and pre-trained Xception networks architecture are discussed in detail. This private anterior–posterior spine X-ray images dataset has been used to fine-tune the network after preprocessing and normalization. Then, the data augmentation method is used to improve the efficiency of the processing of the dataset. All the layers of the networks could be trained, and each layer extracted different visual features from the images for classification purposes. However, due to the small size of the dataset, the pre-trained mobileNetV2 and pre-trained Xception networks were used through transfer learning. Indeed, the pre-trained mobileNetV2 and pre-trained Xception networks were used on the image-Net dataset for classification purposes (Sandler et al. 2018).

Consequently, after preparing the data and transferring them to the networks at this point, only the last layer of the network is trained to classify the types of curvatures. Before network training begins, several hyper-parameters must be set. Kingma published an interesting paper (Kingma and Ba 2014), which indicates that the adaptive moment estimation method (Adam) performs better than other optimizers in generalization. Adam has been used as the optimizer, and this model has been trained for 50 epochs. Another hyper-parameter, 0.001, is required to control the learning rate (Table 3). For the last layer, SVM is used instead of the sigmoid activation function to avoid the vanishing gradient problem. To achieve the best accuracy, the networks must be fine-tuned to detect whether the spine has a C or S shape.

Table 3 Hyper-parameters used to fine-tune the model

4.6 Performance criteria for classification

After completing the training, the developed models should be evaluated using test datasets. The models' performance is assessed based on accuracy, recall, precision, and the F1 score. In this section, all the performance metrics that have been used in this paper are discussed. In the following definitions and equations, True-Positive (TP) represents the number of C-shaped images as C-shaped in classifying patients with C-shaped and S-shaped curvature. True-Negative (TN) indicates the number of S-shaped images specified as S-shaped. False-Positive (FP) represents the number of S images incorrectly known as C images, and False-Negative (FN) indicates the number of C-curved X-ray images that are incorrectly identified as S:

  • Accuracy: Accuracy is the ratio of the total number of correct predictions and the total number of predictions.

    $$Accuracy=\frac{(TP+TN)}{(TP+TN+FP+FN)}$$
    (1)
  • Precision: Precision is the ratio between the True Positives and all the Positives.

    $$Precision=\frac{TP}{(TP+FP)}$$
    (2)
  • Recall: The recall is the measure of the model correctly identifying True Positives.

    $$Recall=\frac{TP}{(TP+FN)}$$
    (3)
  • F1-score: F1-score is the harmonic mean of precision and recall

    $$F1=2\times \frac{(Precision\times Recall)}{(Precision+Recall)}$$
    (4)

5 Experimental results

To evaluate the performance and implementation of the machine learning algorithms, a standard computer with 8 GB of RAM, and a seventh-generation NVIDIA GeForce 940 MX processor was used. As the backend of the pre-trained MobileNetV2 and pre-trained Xception networks, the Keras Open Source Deep Learning Framework with Tensorflow is used.

The results of the analysis of C-shaped or S-shaped spinal deformities with classical machine learning methods, i.e. SVM and KNN and pre-trained MobileNetV2 and pre-trained Xception networks, are described in this section, and the results are compared.

5.1 The results of performance criteria

Parameters and hyper-parameters were set during the experimentation process. The test dataset images were pre-processed in the same manner as the training images.

The confusion matrix is also used to obtain the classification models of the experiments (Figs. 15, 16, and 17). The confusion matrix allows the number of correct and incorrect diagnoses for the C-shaped and the S-shaped curvature for Scoliosis to be analyzed for the desired models.

Fig. 15
figure 15

C-shaped and S-shaped spinal deformity classification confusion matrix results for the first, second, third, and fourth experiments using KNN

Fig. 16
figure 16

C-shaped and S-shaped spinal deformity classification confusion matrix results for the first, second, third, and fourth experiments using SVM

Fig. 17
figure 17

C-shaped and S-shaped spinal deformity classification confusion matrix results for pre-trained MobileNetV2 and pre-trained Xception

For purposes of assessing the efficacy of the methods, the accuracy, precision, recall, and F1 score for the SVM, KNN methods and pre-trained mobileNetV2 and pre-trained Xception networks for classifications of Scoliosis have been calculated (see Table 4 and Figs. 18, 19, 20, 21).

Table 4 An examination of accuracy, precision, recall, and F1 score using KNN, SVM, pre-trained MobileNetV2, and pre-trained Xception to classify spinal deformities based on recent research methods
Fig. 18
figure 18

Examining accuracy, precision, recall, and F1 score using the KNN method and pre-trained networks to detect the C-shaped curve of the spine

Fig. 19
figure 19

Examining accuracy, precision, recall, and F1 score using the KNN method and pre-trained networks to detect the S-shaped curve of the spine

Fig. 20
figure 20

Examining accuracy, precision, recall, and F1 score using the SVM method and pre-trained networks to detect the C-shaped curve of the spine

Fig. 21
figure 21

Examining accuracy, precision, recall, and F1 score using the SVM method and pre-trained networks to detect the S-shaped curve of the spine

A comparison of accuracy, precision, recall, and F1 score for C-shaped and S-shaped spine curvature classification based on classical machine learning methods, i.e. SVM and KNN, against the two pre-trained networks proposed by MobileNetV2 and Xception paid is presented in Figs. 18, 19, 20 and 21. As shown, the pre-trained networks significantly outperform classical machine learning methods due to automatic extraction.

In Table 4, the performance of the proposed methods is compared with those of radiographic image classification articles published in recent years.

Considering Table 4, pre-trained deep networks outperform classical machine learning methods due to their automatic feature extraction attribute. Furthermore, pre-trained Xception and pre-trained MobileNetV2 networks have better performance than the aforementioned methods of articles (Dansana et al. 2020; Minaee et al. 2020; Rahman et al. 2020) due to the use of depth-wise separable convolution and pointwise convolution.

Most commonly, the sigmoid function is used as the activation function of the last layer in binary classification; however, this function may suffer from vanishing gradient. The purpose of this article is to prevent this problem and improve the performance of networks by using SVM as the activation function of the last layer.

Based on the results of five experiments (Table 4 and Figs. 18, 19, 20, 21), pre-trained Xception performed better than the pre-trained MobileNetV2 network and classical SVM and KNN methods for the classification of spinal curvature types. After pre-trained Xception, pre-trained MobileNetV2, experiment 4, experiment 3, and finally experiments 2 and 1 are listed in order of accuracy. The results indicate that the SVM algorithm is more powerful than the KNN algorithm since the SVM algorithm is more accurate at detecting the type of spinal curvature.

According to Table 4, the proposed pre-trained networks with the SVM activation function have significant performance. Table 5 compares the classic machine learning algorithms (experiments 1 to 4) and the proposed pre-trained networks (experiment 5), highlighting the accuracy, computational complexity, the number of features extracted, and how the feature extraction was investigated in each experiment.

Table 5 The accuracy, time complexity, and feature extraction methods of classical machine learning algorithms and pre-trained neural networks

Table 5 shows that pre-trained networks perform significantly better than classical methods in terms of accuracy and speed. Similarly, due to the use of automatic feature extraction and data augmentation, they are more noise-resistant and robust. In addition, they have data augmentation for the problem of lack of data.

5.2 Analyzing the generalizability of the results

Having reviewed recent articles (Mohammed et al. 2021; Nagi et al. 2022; Rajinikanth et al. 2022), the lung pneumonia dataset in (Mohammed et al. 2021) was used to check the generalizability of the proposed deep pre-trained networks to other datasets and populations. There are 800 lung radiographs in this dataset (400 healthy cases and 400 COVID cases). In this dataset, 80% is considered for the train set and 20% for the test set, the results of which are presented in Table 6.

Table 6 Analyzing the generalizability of the proposed networks using lung pneumonia data

On the lung pneumonia dataset, the proposed pre-trained networks with SVM activation function perform well (Table 6).

5.3 Discussion

Due to its automatic feature extraction, deep neural networks can be more powerful than classical machine learning algorithms. Generally, deep convolutional neural networks (DCNNs) require large amounts of datasets to reach significant accuracy, and this training typically takes hours or even days. It is evident, however, that this paper's objective is the classification and analysis of the pre-trained mobileNetV2 and pre-trained Xcepion models and classic methods. Therefore, the parameters of the pre-trained Xception and pre-trained MobileNetV2 models can be shared via transfer learning with the new models and dataset to accelerate and optimize the training of the new model. Furthermore, Depth-wise convolution, pointwise convolution, and Inverted Residual Blocks reduce the number of network parameters, thereby pre-trained MobileNetV2 and pre-trained Xception with SVM are much easier to use, show substantial accuracy even with small datasets, and can even be used on mobile devices. For example, the TensorFlow lite open-source API can be used to deploy this article model on an Android smartphone for deep learning, making this method readily applicable to physicians and spine specialists.

5.4 Conclusion

One of the most complex spinal deformities is scoliosis. There are two types of abnormality: C-shaped and S-shaped. Each of these abnormalities requires a different treatment procedure. The vertebrae can sometimes balance themselves and it can be difficult to determine the type of curvature. Often, the bias of the observer and the low quality of the image may lead to an incorrect diagnosis of the type of abnormality. Secondly, there are not enough doctors and radiologists specializing in this field to diagnose and guide patients. The evaluation of abnormalities, however, takes time. Furthermore, this disease is progressive, and the patient may experience more severe stages of the disease over time, requiring braces and complicated surgery. The importance of automated assessment systems can be seen here. In order to evaluate spine abnormalities promptly, it is necessary to use intelligent methods and machine learning algorithms. Since deep networks automatically extract features from medical images, they are useful for the analysis of medical images and the development of spinal deformity evaluation systems.

This study aims to examine the type of spinal deformity through the integration of technology and health care. The automatic system acts as an expert radiologist in diagnosing the type of abnormality and assessing many patients in a short period of time. However, the automatic evaluation system does not display any bias in its evaluation.

This study compares the classical SVM and KNN machine learning algorithms and the pre-trained MobileNetV2 and pre-trained Xception networks to classify anterior–posterior radiology images of C-shaped or S-shaped scoliosis. Sometimes, the type of deformation cannot be initially detected. It is essential to identify the type of curvature to develop systems for assessing spinal deformities. This article aims to provide an overview of the classification of the types of spinal curvature due to the importance of the topic and the lack of research in this field.

Various image representations are used in this article to extract features from images. Generally, image representations can be categorized into two groups: low-level image representations and local patch-based image representations. This study uses texture features as a low-level representation of the image. A BoW represents an image using the local patch. In various experiments, these features have been employed to classify X-ray images of spinal deformities.

Due to automatic feature extraction, the pre-trained Xception and pre-trained MobileNetV2 network with SVM as the final activation function perform significantly better than the classical methods. Furthermore, the results indicate that [(texture and shape) + pixel information + LBP] and BoW-based classification methods of feature extraction perform better than other models produced with other representation techniques. Finally, the SVM algorithm is more powerful and has greater accuracy than KNN.

A limitation of this article is the small dataset of anterior–posterior spine images and the lack of data. Deep networks require large datasets with correct labeling to be effective. As a solution to this problem, data augmentation and transfer learning are employed in this article. In future research, a larger dataset could be utilized, or transfer learning and data augmentation could be used.

Additionally, since X-rays negatively affect a patient's quality of life, the evaluation of the patient should be conducted with the minimum amount of radiation to diagnose the type and severity of the abnormality.

One of the potential ethical principles when using radiographic images is to avoid using personal information about patients. During the collection of the dataset, only radiographic images without any personal information were collected, and the machine learning algorithms and deep networks of the images, along with their curvature type labels, are used as the ground truth.

As a practical tool, this evaluation system can be used to help the doctor make a clinical diagnosis and to assist him in making a final decision. To fully evaluate the abnormalities in the spine structure automatically in future research, it is suggested that segmenting the vertebrae, measuring the Cobb angle, and detecting the type of curvature should be incorporated into the spinal automatic evaluation system with significant accuracy. In the end, the doctor makes the final decision for the patient based on his diagnosis and the information provided by the automated evaluation system.

It is noteworthy that since radiation can affect the quality of life of patients, and patients have to be exposed to radiation regularly during frequent visits, this evaluation system is highly valuable since it can determine the type of curvature of the spine, the segmentation of the vertebrae, the degree of curvature of the spine and calculate Cobb's angle based on the least radiation required.

As a last point, to advance the research and development of the proposed method, a larger dataset can be used to detect C-shaped curvature of the spine and S-shaped curvature of the spine to improve the classification accuracy of the spinal curvature type. In addition to these neural networks, other networks can be used to classify spine curvatures. It is also possible to improve the performance of networks with activation functions, which are crucial to the prevention of vanishing gradient.