Introduction

The paraspinal muscles play an important role in trunk movement and spinal stability. Several studies [2, 3, 6, 8] have demonstrated an association between imaging parameters of the paraspinal muscles such as cross-sectional area (CSA) size, shape, density, and volume, and spinal degeneration and low back pain (LBP). The measurements of these imaging parameters in clinical practice, however, are not reliable enough as they are usually measured in a 2D axial CT image, which can be chosen differently from hospital to hospital. Although measuring the paraspinal muscles in 3D holds the potential to improve the accuracy, it has not become common as it requires expertise- and time-intensive manual segmentation. The integration of more automated procedures for the reliable 3D segmentation of paraspinal muscles may reduce the label-intensiveness associated with manual methods and provide reliability and reproducibility of the acquired imaging parameters with respect to segmentation bias and temporal drift, especially for multicenter, longitudinal studies.

Figure 1 (left) shows the entirety of the paraspinal muscles, which run along almost the entire spine. There is a pair of the muscles on both sides of the body. Automatic 3D segmentation of paraspinal muscles from CT images is challenging due to the size of the data, the large variability of muscle shape and appearance, and the close contact of paraspinal muscles with the surrounding muscles which appear with almost the same intensities as shown in Fig. 1(right).

Fig. 1
figure 1

Left: the paraspinal muscles (red) and bone (gray) in the 3D image. The paraspinal muscles are quite large and run along almost the complete spine. Right: the paraspinal muscles seen in an axial slice (top) and the expert manual segmentation (red contours in the bottom image)

Despite the fact that there are significant progresses made in automatic segmentation of muscles from MR images [5, 11, 13, 15, 17,18,19, 22, 28, 30], only a few methods have been introduced before to address the problem of automatic segmentation of muscles from CT data [9, 10, 12, 20, 20, 29]. The published CT muscle segmentation methods can be classified into two categories: 2D methods and 3D methods. The methods in the former category usually work on 2D cross-sectional images taken at specific skeletal landmarks instead of 3D scans. For example, Wei et al. [29] presented a 2D atlas-based method for segmenting paraspinal muscles from 2D axial CT images. Another 2D method was introduced in [20], where a finite element method (FEM)-based deformable model was developed to incorporate a priori shape information via a statistical deformation model (SDM) within the template-based segmentation framework for automatic segmentation of skeletal muscle. Recently, Kume et al. [12] have investigated deep convolutional neural networks (CNN)-based approaches for automatic segmentation of paraspinal muscles at the level of the twelfth thoracic vertebrae in torso CT images. An average Dice coefficient of 86.3% was reported. In contrast, the methods in the latter category work directly on 3D scans. Along this line, Kamiya et al. proposed an rule-based expert system for the segmentation of the psoas major [9] and rectus abdominis [10] muscles from CT images, where the shape of the muscles was approximated by a simple quadratic function. An average Jaccard Similarity Coefficient (JSC) of 0.841 was reported in [10]. Inoue et al. [7] introduced a method to segment psoas major muscle using higher-order shape prior and reported an average JSC of 76.5%.

In this paper, we propose a novel learning-based method to address the challenging problem of fully automatic segmentation of paraspinal muscles from 3D torso CT images. In comparison with previous work, our contribution is as follows:

  • To speed up the segmentation and to improve accuracy, we propose a novel multi-scale iterative random forest (RF) classification method for fully automatic segmentation of paraspinal muscles from CT images.

  • Inspired by the auto-context model [21, 25], we propose to employ features derived from multi-source information, including the original torso CT images and later also the iteratively estimated and refined probability maps of the paraspinal muscles.

  • We conduct experiments to evaluate the performance of the present method and to compare the accuracy of the present method with a deep learning-based method.

The paper is organized as follows. In the next section, we will describe the method. “Experimental design and results” section will present the experimental results, followed by discussions and conclusions in “Discussions and conclusions” section.

Fig. 2
figure 2

A schematic illustration of how the training procedure works. The appearance features extracted from down-sampled CT images are used to train “Classifier 1,” and then, both appearance features and the context features from probability maps are used to train the subsequent classifiers. We also employed a multi-scale strategy to speed up the training in high resolution

Materials and method

We formulate the segmentation of paraspinal muscles as a two-class classification problem. To solve such a classification problem, we propose to employ random forests [1] and auto-context model [25], and conduct the classification in multiple scales.

Multi-scale random forest classification with auto-context model

Our method is inspired by Qian et al. [21] and Tu and Bai [25]. It is a supervised learning method consisting of training and testing stages. In the training stage, we will train a sequence of classification forests, as shown in Fig. 2. In the first iteration, we extract only the appearance features from the CT images to train a classification forest (“Classifier 1” in Fig. 2). By applying the trained forests in the first iteration, each training subject will produce tissue probability maps for paraspinal muscles or background, respectively. In the subsequent iterations, the tissue probability maps obtained from the previous iteration will be used as additional source information for training, thus getting a subsequent classification forest (e.g., “Classifier 2” in Fig. 2). It was demonstrated in [21] that the context features could encode the spatial constraints into the classification, thus improving the quality of the estimated tissue probability maps.

Similarly, in the testing stage, given a target CT image, we can obtain the initial tissue probability maps by applying “Classifier 1” using only the appearance features, as shown in Fig. 3. In the subsequent iterations, along with the appearance features, the tissue probability maps obtained from the previous iteration are also fed into the subsequent classifier for refinement.

Fig. 3
figure 3

A schematic illustration of how the testing procedure works. We used “Classifier 1” to get the tissue probability maps of the down-sampled test image. Then, in the later iterations, the tissue probability maps obtained from previous iteration are also fed into the next classifier for refinement. Multi-scale strategy is used to speed up the testing

In theory, we can apply RF classification method directly to get 3D segmentation. In practice, however, due to the large size of the torso CT data (the size of the data ranges from \(512 \times 512 \times 802\) voxels to \(512 \times 512 \times 1031\) voxels), directly applying RF classification method will lead to long training and testing time. In this paper, we propose a multi-scale strategy to address this issue. We conduct both training and testing in multiple scales. More specifically, during training, we first train two classifiers (“Classifier 1” and “Classifier 2” as shown in Fig. 2) following the above procedure on down-sampled training images. For the high-resolution training images, instead of training a classifier from appearance features extracted from high-resolution data to get the initial tissue probability maps, we up-sample the probability maps obtained from classifiers in low resolution. We empirically found that the up-sampled probability maps from “Classifier 1” led to more accurate segmentation results. Furthermore, for each training data, we extract a region of interest of the paraspinal muscles by dilating the associated ground-truth segmentation and randomly sample training data points only from this region in order to train a classifier in high resolution (“Classifier 3” in Fig. 2). Similarly, during testing, we also up-sample the probability maps obtained from “Classifier 1” to provide an initial tissue probability maps in high resolution. We then up-sample and dilate (in this study, we dilate 10 voxels along each axis) the binary segmentation results obtained from the probability maps of “Classifier 2” by thresholding and morphological operations to provide a mask, which will then constrain the test region for “Classifier 3,” i.e., we only apply “Classifier 3” to every voxel inside the masked region in order to compute the tissue probability maps in high resolution. Thresholding, followed by morphological operations to remove isolated small volumes and internal holes, is used to get the binary segmentation from the probability maps of “Classifier 3,” which is then taken as the segmentation output of the present method.

Appearance features and context features

Considering the size of the data, we use the random Haar-like features as introduced in [27] for both appearance features and context features. Specifically, as shown in Fig. 4, for each voxel x, its Haar-like features are computed as the local mean intensity of any randomly displaced cubical region \({R_1}\) or as the mean intensity difference over any two randomly displaced cubical regions (\({R_1}\) and \({R_2}\)) within the cubic image patch R around the voxel x in a source image I.

$$\begin{aligned} f(x,I) = \frac{1}{{|{R_1}|}}\sum \limits _{p \in {R_1}} {I(p)} - b\frac{1}{{|{R_2}|}}\sum \limits _{q \in {R_2}} {I(q)} ,b \in \left[ {0,1} \right] \nonumber \\ \end{aligned}$$
(1)

where R is the patch centered at voxel x, I is any kind of source image, and the parameter \(b \in \left[ {0,1} \right] \) indicates whether one or two cubical regions are used (as shown in Fig. 4, for \(b=0\) and \(b=1\)).

To accelerate the feature extraction within each cubical region, we use the well-known integral image technique as introduced in [26]. Details about how to compute the integral image of a quantity can be found in [26]. The quantity can be the voxel intensity value or the estimated tissue probability value. Advantage of using integral image lies in the fact that once we obtain an integral image of the quantity over the complete CT volume, the sum of the quantity in any sub-volume or cubical region can be calculated quickly in constant time no matter how big the size of the cubical region is [26].

Fig. 4
figure 4

A schematic illustration of how the Haar-like features as defined by Eq. (1) are computed for two different situations: when \(b=0\) (left) and when \(b=1\) (right)

Table 1 Demographic data of 20 subjects included in our study

Data description

After local institution review board (IRB) approval, the present method was evaluated on torso CT data with associated manual segmentation of 20 subjects. CT images used in this study are non-contrast torso CT images taken at Light Speed Ultra 16 scanner (manufactured by GE) at Gifu University Hospital. We randomly partitioned the 20 subjects into two evenly distributed groups. We then took one group as the training data and the other group as the test data. Table 1 shows the demographic data of all 20 subjects used in our study.

All the CT data have an isotropic voxel resolution of 0.625 mm. The manual segmentation for each of data was created by Mr. Masanori Kume using a graph cut-based interactive method implemented in the common software platform called “PLUTO” (http://pluto.newves.org/trac) [16]. The obtained segmentation was then verified and corrected slice by slice by an anatomical specialist.

Implementation details

We trained and tested the random forest classifiers in two different scales. In order to train “Classifier 1” and “Classifier 2” in low resolution, we first down-sampled each training data into its one fourth of its original resolution along each axis. During training, we always sample evenly distributed data points from each training data, i.e., half of the data points sampled from the paraspinal muscle region and the other half from background. Specifically, in training “Classifier 1,” we randomly sampled 20,000 points from each training data and compute 10,000 Haar-like features for each data point. The size of R was chosen to be \(25 \) voxels. In training “Classifier 2,” again we randomly sampled 20,000 data points from each training data. For each data point, we computed 10,000 multi-source Haar-like features with 5000 from the appearance and the other 5000 from the initial probability maps obtained from “Classifier 1.” The size of R was chosen to be \(45 \) voxels. “Classifier 3” was trained with data in the original resolution. We constrained the region to sample the data points for each training data to be within a ROI computed from the ground-truth segmentation. Again, we sampled 20,000 evenly distributed data points, and for each data point, we computed 10,000 multi-source features for each data point where 5000 features were computed from the training data and the other 5000 features from the up-sampled probability maps as shown in Fig. 2. The size of R for computing Haar-like features in high resolution was chosen to be \(180 \) voxels.

Evaluation metrics

Assuming the automatically segmented set of voxels as AS and the manually defined ground truth as GT, we used both volume overlap metrics and distance-based metrics to evaluate the present method.

Volume overlap metrics

We computed following volume overlap metrics:

  • Dice Coefficients (DC) It quantifies the match of two sets by normalizing the size of their intersection over the average of their sizes and is defined as follows:

    $$\begin{aligned} \hbox {DC} = \frac{{2|\hbox {AS} \cap \hbox {GT}|}}{{|\hbox {AS}| + |\hbox {GT}|}} \end{aligned}$$
    (2)

    where the operator \(| \cdot |\) returns the number of voxels contained in a region.

  • Jaccard Similarity Coefficients (JSC) It is defined as the number of common voxels of the automatically segmented and ground-truth regions over their union:

    $$\begin{aligned} \hbox {JSC} = \frac{{|\hbox {AS} \cap \hbox {GT}|}}{{|\hbox {AS} \cup \hbox {GT}|}} \end{aligned}$$
    (3)
  • Precision (PR) It is defined as the fraction of all automatically segmented voxels that are correct:

    $$\begin{aligned} \hbox {PR} = \frac{{|\hbox {AS} \cap \hbox {GT}|}}{{|\hbox {AS}|}} \end{aligned}$$
    (4)
  • Recall (RC) It is defined as the fraction of all ground-truth voxels that have been corrected segmented by an automatic method:

    $$\begin{aligned} \hbox {PR} = \frac{{|\hbox {AS} \cap \hbox {GT}|}}{{|\hbox {GT}|}} \end{aligned}$$
    (5)

Distance-based metrics

Before we present the definitions of different distance-based metrics, we first define a distance measure for a voxel x from a set of voxels A as:

$$\begin{aligned} d(x,A) = \mathop {\min }\limits _{y \in A} d(x,y) \end{aligned}$$
(6)

where d(xy) is the Euclidean distance of the voxels incorporating the real spatial resolution of the volume data.

We further define the directed Hausdorff measure from a point set A to a point set B as the maximum distance, for all points in A, to the closest point in B. Mathematically, this is given as:

$$\begin{aligned} {\overrightarrow{d} _H}(A,B) = \mathop {\max }\limits _{x \in A} (\mathop {\min }\limits _{y \in B} (d(x,y))) \end{aligned}$$
(7)

The directed percent Hausdorff measure, for a percentile r, is the \(r^{th}\) percentile distance over all distances from points in A to their closest point in B. For example, the directed 95% Hausdorff distance is the point in A with the distance to its closest point in B is greater or equal to exactly 95% of the other points in A. Mathematically, denoting the \(r^{th}\) percentile as \({K_r}\), this is given as:

$$\begin{aligned} {\overrightarrow{d} _{H,r}}(A,B) = {K_r}(\mathop {\min }\limits _{y \in B} d(x,y)),\forall x \in A \end{aligned}$$
(8)

With these definitions, we can define a number of distance-based metrics to quantify the dissimilarity of the automatic segmentation from the ground truth:

  • Average Surface Distance (ASD) It is defined as the average of all the distances from points on the boundary of AS (we denote them as \({B_\mathrm{{AS}}}\)) to the boundary of GT (\({B_\mathrm{{GT}}}\)):

    $$\begin{aligned} \hbox {ASD} = \frac{1}{{|{B_\mathrm{{AS}}}|}}\sum \limits _{x \in {B_\mathrm{{AS}}}} {d(x,{B_\mathrm{{GT}}})} \end{aligned}$$
    (9)
  • Average Symmetric Surface Distance (ASSD) It is defined as the average of all the distances from points on the boundary \({B_\mathrm{{AS}}}\) to the boundary \({B_\mathrm{{GT}}}\) and from points on \({B_\mathrm{{GT}}}\) to \({B_\mathrm{{AS}}}\):

    $$\begin{aligned} \hbox {ASSD}= & {} \frac{1}{{|{B_\mathrm{{AS}}}| + |{B_\mathrm{{GT}}}|}} \nonumber \\&\quad \times \left( \sum \limits _{x \in {B_\mathrm{{AS}}}} {d(x,{B_\mathrm{{GT}}})} + \sum \limits _{y \in {B_\mathrm{{GT}}}} {d(y,{B_\mathrm{{AS}}})} \right) \nonumber \\ \end{aligned}$$
    (10)
  • Modified Hausdorff Distance (MHD) It is defined as the undirected 95 percentile Hausdorff measure [4]:

    $$\begin{aligned} \hbox {MHD} = \frac{{{{\overrightarrow{d} }_{H,95}}(\hbox {AS,GT}) + {{\overrightarrow{d} }_{H,95}}(\hbox {GT,AS})}}{2} \end{aligned}$$
    (11)
Table 2 Segmentation results of the 10 test torso CT data
Fig. 5
figure 5

Segmentation of the best (top, test case 05) and the worst (bottom, test case 03) cases. From left to right, the input image, the probability map from “Classifier 1,” the probability map from “Classifier 2,” the probability map from “Classifier 3,” the final segmentation result, and the ground truth segmentation

Experimental design and results

Experimental design

We conducted two different studies in order to evaluate the efficacy of the present method. For the first study, the segmented result of each test data obtained by the present method was compared with the associated manual segmentation. For the second study, due to the large size of input data, we implemented a 2D fully convolutional network (FCN) [14] based on the network structure of VGG 16 [24]. In this FCN, the fully connected layer in VGG 16 is replaced by a convolutional layer, which is then followed by a \(1 \times 1\) convolutional layer to generate segmentation in a down-sampled resolution. In order to get the segmentation in full resolution, up-sampling is done via deconvolutions [23].

Results

Quantitative segmentation results of the 10 test data is shown in Table 2. Our approach achieved a mean DC of \(93.0\pm 2.1\%\), a mean JSC of \(87.0 \pm 3.5\%\), a mean RC of \(96.4 \pm 1.5\%\), a mean PR of \(89.9 \pm 3.6\%\), a mean ASD of \(0.79 \pm 0.20\) mm, a mean ASSD of \(0.85 \pm 0.19\) mm and a mean MHD of \(2.85\pm 0.56\) mm. Figure 5 shows the segmentation procedures for the best case (top row) and the worst case (bottom row). Qualitatively, it can be found that without incorporating context features, the probability maps (the second column) from “Classifier 1” show high values in relatively large portion of false positive regions. After integrating context features, the area of false positive regions is reduced as reflected by the probability maps (the third column) from “Classifier 2” but not completely removed. By incorporating the up-sampled context features with the constrained region of interest in the high-resolution image space, “Classifier 3” generates probability maps (the forth column) that have significantly reduced false positive regions, demonstrating the efficacy of the present method.

Implemented on a machine with a 3.5GHz Intel(R) i7 CPU with 12 cores and 64 GB RAM, it took on average 46.5 s to segment a torso CT image with the size ranging from \(512 \times 512 \times 802\) voxels to \(512 \times 512 \times 1031\) voxels. In contrast, without using the proposed multi-scale strategy, we have to test each voxel in a given 3D scan, which leads to an average test time of 205.0 s.

The results of the second study are shown in Table 3. In comparison with the 2D FCN method, our method demonstrated better performance. More specifically, the 2D FCN method achieved a mean DC of \(89.9 \pm 2.0\%\), a mean JSC of \(81.7\pm 3.2\%\), a mean RC of \(92.8\pm 5.0\%\) and a mean PR of \(87.5\pm 4.3\%\). In contrast, our method achieved a mean DC of \(93.0\pm 2.1\%\), a mean JSC of \(87.0\pm 3.5\%\), a mean RC of \(96.4\pm 1.5\%\), and a mean PR of \(89.9\pm 3.6\%\).

Table 3 Comparison of the results obtained by a 2D FCN and our method

Discussions and conclusions

Manual and automated segmentation of individual muscles in CT images has been recognized as a challenging task, given the high variability of shapes between muscles and subjects and the discontinuity or lack of visible boundaries between the target muscles and surrounding muscles. In this paper, we proposed a novel learning-based method for automatic segmentation of paraspinal muscles from 3D torso CT images and conducted a validation study to confirm the efficacy of the proposed method.

The results achieved by our method are better than those reported in previous work. For example, based on deep learning techniques, Kume et al. reported a mean DC of 86.3%, while our method achieved a mean DC of 93.0%. Using higher-order shape prior, Inoue et al. [7] reported an average JSC of 76.5% in segmenting psoas major muscles which is lower than what our method achieved. The reason why our method achieved better results than others is probably due to the integration of the multi-source information in a multi-scale learning-based framework. As shown in Fig. 5, the integration of multi-source information and the adoption of the multi-scale strategy progressively refine the probability maps obtained in different stages, leading to an accurate segmentation at the final stage. To get a fair comparison, we implemented a 2D FCN method. Our experimental results showed that the results achieved by our method were better than those achieved by the 2D FCN method.

The present method is not only accurate but also fast, largely due to the proposed multi-scale strategy. It is known that for random forest classification, the test time is proportional to the number of voxels in the test data. The initial segmentation obtained from “Classifier 2” at low resolution allows us to define a mask to constrain the test at high resolution to a smaller region of interest. This can not only improve the learning efficacy, as we concentrate on a smaller region than the complete image space, but also lead to faster algorithm as we will test on less number of voxels. Our experimental results demonstrate that our algorithm is four times faster than the one without using the multi-scale strategy.

It is worth to compare the method introduced in [21] with the present method. First, both methods are based on random forest classification with auto-context model [25]. Second, both studies confirm the effectiveness of incorporating context features for refined segmentation, despite the fact that the method introduced in [21] is applied to multi-parametric prostate MR images while the present method is evaluated on torso CT data. The differences between these two methods, however, are also apparent. More specifically, due to the purpose of the study reported in [21], which aims to localize prostate cancer from in vivo MR images, the resolution of their data is relatively low, leading to small data dimension along the out of plane direction. For example, the highest resolution of the multi-parametric MR images used in [21] is \(0.3125 \times 0.3125 \times 3\,\hbox {mm}^3\). Additionally, their data were cropped around the prostate, which is a relatively small organ, in order to localize the prostate cancer from the cropped MR images. This is the reason why they can repeatedly apply the random forest classification with auto-context model in the original data space to get refined results. In contrast, the resolutions of our data are high in all three axes, leading to large data dimensions. Additionally, as we shown in Fig. 1, the paraspinal muscles are quite large, running along almost the complete spine. Furthermore, we did not purposely crop our torso CT data around the paraspinal muscles, which complicated the learning task for our problem. This has been demonstrated in the second and third columns of Fig. 5, where false positive predictions appear above and below the paraspinal muscles. By combining information extracted from the outputs of two classifiers that are trained in low resolution, we focus the third classifier on learning important multi-source features in a constrained region instead of the whole volume. As demonstrated in the fourth column of Fig. 5, such a strategy significantly reduced the false positive prediction, leading to refined segmentation.

There are limitations in our study. First, the dataset used in our study is relatively small. We are expecting to enlarge the dataset to include torso CT data of over 50 subjects, but the main challenge is to get the ground-truth annotations. Second, all the CT data used in this study were acquired with the same scanner from Gifu University Hospital. It would be interesting to apply our trained model to CT images from other scanners in order to test the inter-scanner robustness. Considering the fact that unlike MR image values, CT values are correlated with tissue attenuation coefficients, we hypothesize that we can directly apply our trained model to CT data acquired from other scanners. Such a hypothesis needs to be verified in our future work. Last but not least, the present method was evaluated on CT data collected with a standard clinical protocol. Whether it will work or not on heterogeneous data acquired in clinical routine needs to be further checked in the future.

In summary, we proposed a novel learning-based method to address the challenging problem of automatic segmentation of paraspinal muscles from 3D torso CT images. Our method is based on multi-scale iterative random forest classifications with multi-source information. The experimental results demonstrated the efficacy of our proposed approach.