Introduction

Imaging has been an integral component of various surgical and non-surgical orthopedic procedures such as total knee replacement (TKR), intramedullary nail locking for femoral shaft fractures, pedicle screw insertion for spinal fusion surgery, lumbar neuraxial anesthesia, and epidural analgesia [25]. Current practice during these procedures relies on intra-procedure 2D fluoroscopy as the main imaging modality for localization and visualization of bones, fractures, implants, and surgical tool positions. However, with such projection imaging, surgeons and clinicians typically face considerable difficulties in accurately localizing bone fragments in 3D space and assessing the adequacy and accuracy of the procedure. This problem has been overcome with 3D fluoroscopy units, however, they are twice as expensive and not widely available as standard 2D units. Finally, fluoroscopy involves significant radiation exposure [25]. The limits to exposure to ionizing radiation should be kept at minimal in order to avoid potential long-term complications. In order to overcome some of these limitations and provide a safe alternative, 2D/3D US has emerged as a safe alternative while remaining relatively cheap and widely available [8]. US image data, however, is typically characterized by high levels of speckle noise, reverberation, anisotropy and signal dropout which introduce significant difficulties during interpretation of captured data. Limited field-of-view and being a user dependent imaging modality causes additional difficulties during data collection since a single-degree deviation angle by the operator can reduce the signal strength by \(50\%\) [8]. In order to overcome these difficulties automatic bone segmentation [8] and registration [21] methods have been developed. Most recently, methods based on deep learning have achieved successful results for segmenting bone surfaces [1, 2, 22, 23]. However, these methods require large amounts of training data and accuracy decreases if the quality of the testing data is low or if testing data comes from a different vendor machine. In the context of bone imaging using US, high quality data represents high intensity bone surface followed by a low intensity region referred to as shadow region. Difficulties in acquiring high quality US images is an ongoing limitation of current US guided orthopedic procedures.

Fig. 1
figure 1

Top row: From left to right in vivo B-mode US image of distal radius, femur, knee, and spine, respectively. Yellow arrows point to high intensity bone features. Red arrows point to the problematic low intensity bone features due to misalignment of the transducer or complex shape of the anatomy. Green arrow quads show the shadow region. Bottom row:Manually segmented gold standard shadow images corresponding to B-mode data shown in the top row. In all the images blue color coded region is the shadow region and red color coded region is the soft tissue interface

Acoustic shadows occur at the interfaces where there is a high impedance difference such as air-tissue, tissue-bone, and tissue-lesion. Bone shadow information can aid in the interpretation of the collected data and has been incorporated as an additional feature to improve the segmentation of bone surfaces from US data [3, 5, 8, 23]. Real-time feedback of bone shadow information can also be used to guide the clinician to a standardized diagnostic viewing plane with minimal artifacts. Finally, shadow information can also be used as an additional feature for registering CT, MRI or statistical shape models (SSM) to US data [21]. However, poor transducer contact or wrong orientation of the transducer with respect to the imaged anatomy can lead to poor shadow appearance and resulting in misinterpretation of anatomy and failure of the computational method using the shadow feature (Fig. 1). Therefore, the enhancement of shadow regions has been investigated and practical solutions have been offered.

Several groups have proposed computational methods to improve the appearance of shadow regions from US data. Karamalis et al. [14] have proposed a random-walk geometric technique, based on image intensity, that models the propagation path of an US signal along the scanline. The generated images were termed confidence map (CM) images. Shadow regions were extracted from the CM images by intensity thresholding. This approach was later extended for processing radio-frequency US data [15]. In [10], shadow images of the brain were extracted by entropy analysis along the scanline. Pixels with low entropy would be selected to form the shadow image [10]. The method was later incorporated into a spinous process segmentation framework [3]. In [11], statistics of B-mode and radio frequency (RF) US data were investigated and used for shadow detection. Mean dice similarity coefficient (DSC) of 0.90 and 0.87 were obtained for the RF and B-mode algorithms. Processing time was not reported. Although promising results in these earlier works were achieved, intensity-based approaches are not robust to typical imaging artifacts and affected by intensity variations. Changing the US machine acquisition settings, sub-optimal orientation of the transducer concerning the imaged anatomy, imaging complex shape anatomy (such as spine), or scanning patients with different body mass index results in the collection of low quality US data (Fig. 1) and decrease the success of intensity-based approaches. RF-based shadow detection overcomes some of the difficulties of intensity approaches, however, they require special hardware, or software, to access RF signal domain which is not available in most clinical US machines. In order to provide an intensity invariant alternative, methods based on local phase image information have been proposed for the enhancement of bone shadow region [8]. The method proposed in [8] uses local phase image features as an input to a L1 norm-based contextual regularization method which emphasizes uncertainty in the shadow regions. Quantitative analysis, performed on a manually selected region of interest (ROI) achieved a mean DSC of 0.88. The mean computation time was 9.3 seconds making the method not suitable for real-time applications. In [17] a weakly supervised method for acoustic confidence estimation for shadow regions from fetal US data was proposed. In particular, a shadow-seg module to extract generalized shadow features for a large range of shadow types in fetal US images under limited weak manual annotations was presented. Both a classification and a segmentation networks with attention layer mechanism were used. The reported average DSC, Recall, and Precision were 0.71, 0.72, 0.73 respectively.

In this paper, we propose a conditional GAN(cGAN)-based method for accurate real-time segmentation of bone shadow regions from in vivo US scans. Our specific contributions include: (1) A novel GAN architecture designed to perform accurate, robust and real-time segmentation of bone shadow images from in vivo US data. (2) We show how the segmented bone shadow regions can be used as an additional proxy to improve bone surface segmentation results of a multi-feature guided (CNN) architecture [1]. The significance of using shadow features-based segmentation is that they can be generated in real time as opposed to local phase image-based methods [1] which takes around one second. (3) We evaluate the proposed method on extensive in vivo data obtained from 27 volunteers using two different US imaging systems. We provide quantitative evaluation results against state-of-the-art GAN architectures.

Methods

Data acquisition

Upon obtaining the approval of the institutional review board (IRB), two imaging devices were used to collect data from 27 healthy subjects. Depth settings and image resolutions varied between 3–8 cm, and 0.12–0.19 mm, respectively:

  1. 1.

    Sonix-Touch US machine (Analogic Corporation, Peabody, MA, USA) with a 2D C5-2/60 curvilinear probe and L14-5 linear probe. Using this device we have collected 1000 scans from 23 subjects.

  2. 2.

    Clarius C3 hand-held wireless ultrasound probe (Clarius Mobile Health Corporation, BC, Canada). Using this device we have collected 235 scans from 4 subjects.

All the collected scans were scaled to a standardized size of \(256\times 256\). The bone surfaces were manually segmented by an expert ultrasonographer. Gold standard bone shadow images were constructed automatically by investigating the intensity values from the manually segmented bone surfaces in the scanline direction. Region below the manually segmented bone surface is identified as shadow region. In total, we had 1235 B-mode US images categorized into four groups of bone structures: radius, femur, spine and tibia. We have performed fivefold cross validation on the Sonix-Touch data. No scans from one patient appeared in more than onefold. Training of the network architectures was performed using the Sonix-Touch data only. All the data, 235 scans in total, obtained from the Clarius C3 probe were used as test data.

Network architecture

Our architecture is based on the common GAN layout consisting of two co-existing neural networks; a generator G that attempts to generate synthetic samples and a discriminator D that tries to discriminate between generated synthetic samples and real ones [6]. In this work, we adopt the conditional aspect presented by [12] with our generator G and discriminator D both incorporating additional information into account. Our proposed cGAN-based bone shadow segmentation and bone surface segmentation network architecture is shown in Fig.  2. The training of our proposed cGAN architecture follows the typical optimization problem [12] such that the discriminator D is trying to maximize and the generator G is trying to minimize the following objective \(\mathcal {G}\):

$$\begin{aligned} \mathcal {G}= & {} \arg \underset{G}{\mathrm{min}} \; \underset{D}{\mathrm{max}} \; \mathbb {E} _{\mathrm{BM},\mathrm{GS}} \left[ \log D(\mathrm{BM},\mathrm{GS}) \right] \\&+\, \mathbb {E} _{\mathrm{BM},z} \left[ \log \left( 1-D(\mathrm{BM},G\left( \mathrm{BM},z \right) ) \right) \right] \\&+\,\lambda \mathbb {E} _{\mathrm{BM},\mathrm{GS},z}\left[ \left\| \mathrm{GS}-G\left( \mathrm{BM},z \right) \right\| _{1} \right] \end{aligned}$$

in which GS represents gold standard shadow images, z represents Gaussian noise in initial training but was applied as dropout on some layers in the convolution blocks, BS represents the segmented bone shadow image, GS represents the gold standard bone shadow image, and BM represents the B-mode US image. Different from a traditional GAN architecture, our actual generator G model was conditioned on the in vivo B-mode US image, BM, and is additionally tasked to generate BS images that are as close as possible to the GS images with the introduction of the L1-distance term as shown in the equation above. Our generator architecture is based on the common contractive-expansive design where the encoder maps the input image into a low-dimensional latent space, and the decoder maps the latent representation into the original space. It is trained to generate bone shadow, BS, images. However, unlike [12] where the generator was based on [20], we employ a different structure for the generator. Similar to [1], the input is processed through convolutional blocks, with each block consisting of several convolutional layers (Fig.  3). We incorporated skip connection and projection blocks similar to [16]. Our skip connection blocks, denoted as S, consist of a \(1\times 1\) convolution, a \(3\times 3\) convolution, and another \(1\times 1\) convolution with each convolution operation followed by batch normalization and leaky rectified linear unit (Leaky ReLU) activation. This process reduces and restores channel dimensions. In our design Leaky ReLU was used in the encoder and decoder. In [24], it has been shown that the Leaky ReLU achieves lower training and test errors compared to ReLU. Furthermore, Leaky ReLU attempts to overcome the ’dying ReLU (vanishing gradient)’ problem by maintaining a small slope in the negative portion while training the piecewise constant gradient, making the network converge faster during training. This informed our choice of Leaky ReLU. A concatenated input and the aforementioned convolutions produce the output. As for our projection blocks, denoted as P, we add a \(1\times 1\) convolution to the projected input, and the rest is similar to the skip connection blocks. In the decoder, we replace all convolution operations by transposed-convolutions. We also use a stride of 2 transposed convolutions to upsample the feature maps. Therefore, these skip and projection blocks with transposed-convolutions are denoted as \(S^\prime \)and \(P^\prime \), respectively. Additionally, one difference in the decoder is that the batch normalization is followed by a dropout layer with a dropout rate of \(50\%\). The architecture of the generator can be summarized as:

  • Encoder: S32 S32 P32 - S64 S64 P64 - S128 S128 P128 - S256 S256 P256 - S512 S512 P512

  • Decoder: \(S^\prime 512\)\(S^\prime 512\)\(P^\prime 512\) - \(S^\prime 256\)\(S^\prime 256\)\(P^\prime 256\) - \(S^\prime 128\)\(S^\prime 128\)\(P^\prime 128\) - \(S^\prime 64\)\(S^\prime 64\)\(P^\prime 64\) - \(S^\prime 32\)\(S^\prime 32\)\(P^\prime 32\)

In our discriminator model a two-input \(N\times N\) PatchGAN-like discriminator [12] was used to essentially classify \(N\times N\) patches of the input image as real or synthetic. Like the aforementioned generator, our discriminator architecture consists of five convolutional blocks, where a final convolution is applied to the last layer to map the 1-dimensional output before applying a Sigmoid function. Each batch normalization was followed by 0.2-slope Leaky ReLU. An Adam solver with a 0.0002 learning rate was used and the structure of the discriminator can be expressed as follows:

  • Discriminator: S32 S32 P32 - S64 S64 P64 - S128 S128 P128 - S256 S256 P256 - S512 S512 P512

Fig. 2
figure 2

Our proposed conditional GAN in which the discriminator learns to classify between real (gold standard shadow images, GS) and fake (generated bone shadow images, BS)

While our proposed cGAN architecture was used to segment bone shadow regions BS, our bone surface segmentation network with its dual input proposed in [1] was used in our model to localize bone structures. The B-mode US image BM, and the segmented bone shadow image BS were used as input to our multi-feature CNN architecture. Feature maps extracted from both images are fused in a fusion layer at early (pixel level), mid (feature level) and late (classifier level) stages. Concatenation fusion was used as the fusion operation [1], which does not define any correspondence as it stacks feature maps at the same spatial locations across the feature channels.The multi-feature CNN architecture was trained separately from our proposed cGAN architecture using cross-entropy loss.

Fig. 3
figure 3

An overview of our proposed cGAN architecture with its a generator’s encoder consisting of ten skip connection blocks (blue), in addition to five projection blocks (yellow) and b generator’s decoder consisting of ten transposed skip connection blocks (orange), in addition to five transposed projection blocks (green). Depths of each convolutional layer are indicated in each block by \(d_1\) and \(d_2\). c Our proposed patchGAN-like discriminator

Quantitative evaluation

Bone shadow segmentation: The performance of our proposed design was compared against state-of-the-art GAN networks proposed in [12] and [18]. The depths of the networks were increased to a scale close to our proposed design. The bone shadow regions were also segmented, from the test data, using the local phase image-based bone shadow enhancement method proposed in [7]. In order to show the effectiveness of discriminator network we have obtained bone shadow segmentation results by only training our proposed generator network. Finally, to show the improvements achieved using a cGAN architecture over a traditional CNN architecture we trained the U-net network, proposed in [20], using B-mode US image features and gold standard bone shadow images. Based on [4, 7, 13, 19], four error metrics were calculated in our testing set: Dice, Rand error, Hamming Loss, as well as the intersection over union (IoU). The evaluation metrics are computed on the estimated probability maps, with grayscale color maps, and compared to the gold standard bone shadow images.

Bone surface segmentation: Bone shadow images segmented using our proposed design, [12] and [18], were used as an additional feature to our multi-feature CNN architecture [1] for bone surface segmentation. Our method utilizes fusion of feature maps obtained from B-mode US data and bone shadow images. During the evaluation studies we investigate different fusion architectures: early, mid and late fusion [1]. We also investigate bone surface segmentation results if gold standard bone shadow images are used as an additional feature. The bone segmentation networks were trained to minimize the cross-entropy loss. We have used Adam Optimizer with batch size of 8 and a learning rate of 0.0002 for 36,000 iterations. In addition to the previously error metrics explained in this section, we also evaluate the average Euclidean distance (AED) error for the task of bone segmentation. AED was calculated between the automatically segmented bone surfaces and the manual expert segmentation [1].

Results

Our experiments were conducted using the Keras framework and Tensorflow as backend with an Intel Xeon CPU at 3.00GHz and an Nvidia Titan-X GPU with 8GB of memory. Our network converged in about 8 hours during the training process. Testing on average took 54 milliseconds in total for bone shadow and bone surface segmentation.

Quantitative results

Bone shadow segmentation: Table 1 shows the performance difference of bone shadow segmentation methods investigated. Overall our method outperforms previous state-of-the-art GAN architectures and the local phase-based bone shadow enhancement method [7]. The local phase image-based method proposed in [7] achieved the lowest DSC value (0.28). However, we would like to mention that in the original work of [7] a ROI, covering a bone interface spanning the full width of the image, was selected during quantitative evaluation. During our analysis we did not select a ROI and rather used the full B-mode US image. Our generator network, without the discriminator, achieved average dice value or 0.67. While adding the discriminator resulted in \(39\%\) improvement in Dice value. Our proposed cGAN architecture achieves \(8\%\) and \(3\%\) improvement, in DSC value, over the state-of-the-art GAN architectures proposed in [12, 18] respectively. A paired t-test, for IoU, DSC and AED results at a \(\%5\) significance level, between our proposed network and the networks in [12, 18] achieved p values less than 0.05 indicating that the improvements of our method are statistically significant. The improvement over the U-net architecture [20] was \(46\%\) for DSC value. We have also observed that our generator network, without discriminator, outperforms U-net [20] by \(6\%\) in Dice value.

Bone surface segmentation: Quantitative results for bone surface segmentation are presented in Table 2. The average numerical error calculations show that that the late-fusion design had the lowest errors, and the highest average IoU and Dice (Table 2) (Note: In all tables, the results of the method that outperformed other methods were indicated in bold). A paired t-test, for IoU, DSC and AED results at a \(\%5\) significance level, between our proposed network and the networks in [12, 18] achieved p values less than 0.05 indicating that the improvements of our method are statistically significant. There was no statistical significance when using gold standard bone shadow images and the bone shadow images generated using the proposed design for late fusion design. When using local phase image features as an additional feature for our multi-feature CNN architecture [1] the AED error was 0.30 mm compared to 0.11 mm when using the generated bone shadow images.

Table 1 Bone shadow segmentation error metrics
Table 2 Bone segmentation error metrics

Qualitative results

Fig. 4
figure 4

Qualitative results for bone shadow segmentation. a In vivo B-mode US images of femur, tibia, radius, knee, and spine. b Gold standard bone shadow images. c Bone shadow results obtained using local phase-based ultrasound transmission maps method presented in [7]. d Bone shadow results obtained using Ronneberger et al. [20] e Bone shadow results obtained using Radford et al. [18] f Bone shadow results obtained using Isola et al. [12]. g Bone shadow results obtained using our proposed cGAN

Qualitative results of our proposed model are shown in Fig. 4. We demonstrate five examples of in vivo US B-mode images bone types, namely: femur, tibia, radius, knee, and spine, where red pixels indicate high prediction scores while blue pixels indicate low prediction scores for the prediction. Gold standard bone shadow images obtained by an expert are displayed followed by generated bone shadow results obtained using the convolutional network presented by Ronneberger et al. [20] and generative networks in [12, 18] and our proposed model, as shown in Fig. 4d through g. In Fig. 4c, we demonstrate shadow results obtained using local phase-based ultrasound transmission maps method presented in [7]. Investigating the qualitative results we can conclude that our proposed method segments bone shadow images with minimal artifacts.

Discussion and conclusions

A method, based on a novel GAN, for real-time and accurate segmentation of bone shadow regions from in vivo US scans was proposed. Our model has two main networks: (1) a cGAN to generate bone shadow images and (2) a segmentation network that will take the generated bone shadow data in conjunction with B-mode US data for localization of bone surfaces. Our integral component of building the generator and discriminator was the skip and projection blocks. To the best of our knowledge, this was not previously investigated in the community. We also would like to mention that this is the first work proposing a novel cGAN architecture for the task of bone shadow segmentation. The projection blocks allow semantic information to be more efficiently passed forward in the network while progressively increasing feature map sizes, compared to simple convolutions which is used in many designs including in [20]. By implementing these projection blocks, we allow to have more comprehensive feature maps that improve the bone shadow generation. We have also extended the depth of the discriminator used in the state-of-the-art [12]. This is one of the reasons why our cGAN outperformed other state-of-the-art networks on this testing data set. Based on these results, we can conclude that having a cGAN with prior information can significantly improve the results for the task at hand. In this study we have also shown the importance of adverserial training. The success of well trained CNN architectures is effected if the architecture is deployed on test data coming from different centers, vendors, or changing acquisition parameters. For US data, even when the machine is from the same vendor the image acquisition settings can be adjusted from one scanning procedure to the next. BMI of the patient, orientation of the transducer with respect to the imaged anatomy will also change the appearance of the collected data drastically. We have shown that GAN are more robust to these conditions. We have also investigated how to combine information from bone shadow and B-mode US data by analyzing different fusion strategies. Our results demonstrate that for the task of bone segmentation fusing B-mode US and bone shadow features at a later stage outperforms early and mid fusion, specifically for the dataset obtained from Clarius C3 US probe. One of the advantages of the proposed work is that bone shadow features are obtained instantaneously making the computational time required suitable for real-time applications. Our future work will involve (1) extensive clinical validation of the proposed GAN-based method on data obtained from subjects who have differing pathology in their bone such as fracture or bone deformity such as scoliosis. We will also extend our network architecture to process volumetric US data [9].