Introduction

Over the past 25 years, image analysis has made significant advancements in the fields of radiology and biomedical engineering. Three-dimensional anatomic models oftentimes rely on CT and MR datasets for their geometric definitions. Segmentation techniques are used to distinguish structures of interest from the remainder of the image. A major class of image segmentation techniques incorporates thresholding. Threshold techniques, which make decisions based on local pixel information, are effective when there is a bimodal distribution of the histogram between the region of interest and background. The readily distinguishable intensities of bone make threshold techniques a common exercise for orthopedic applications. Because spatial information is neglected, however, indistinct region boundaries can prove problematic. While thresholding-based algorithms are easy to employ, significant manual intervention is often required to provide reliable and anatomically correct structural borders. This proves especially true at the bony articulations.

A number of algorithms have been employed to date to automatically segment bony regions of interest from 3D medical images. This has included the use of filtering approaches [1], application of 3D Markov Random Fields [2], active contour models [3, 4], watershed segmentation [5], fast marching and level sets [6], and atlas-based segmentation [7]. The breadth of applicable segmentation algorithms is shown in these examples. The majority of these algorithms have been applied to relatively large bony regions including the hip and knee.

A recent trend in the field of image segmentation has been toward artificial intelligence-based algorithms [826]. Artificial neural networks (ANNs), a form of artificial intelligence inspired by neurophysiology, have shown great potential for segmenting medical images. Unlike other artificial intelligence systems, ANNs require no explicit rule generation or pre-programming. Instead, they are trained using example data with known output. The ANN learns through the training process, and develops implicit rules for analysis. The effect of training can then be measured using a separate set of data where the results are known, but not provided to the ANN.

Artificial neural networks are collections of mathematical models that emulate some of the observed properties of biological nervous systems and draw on the analogies of adaptive learning. The key element of the ANN paradigm is the structure of the information processing system. It is composed of a large number of highly interconnected processing elements that are analogous to neurons and are tied together with weighted connections (fixed or variable) analogous to synapses. Learning in biological systems involves adjustments to the synaptic connections that exist between the neurons. This is true of ANNs as well. Learning typically occurs by example through training, or exposure to a truthed set of input/output data where the training algorithm iteratively adjusts the connection weights. These connection weights store the knowledge necessary to solve specific problems. The storage of information across the network weights enables generalizations to be made. Consequently, appropriate classifications are made even for input patterns not actually included in the training set, provided that the training set covered a representative group of patterns. This ability to learn and generalize allows neural networks to solve image-processing problems that are not readily tractable using rule-based conventional classifiers. Neural networks have been applied to the field of image segmentation by including probability information and signal intensity information for a voxel and its local neighborhood [17, 22].

Such automated routines have been used in the identification of bone, chest, and breast lesions [1014, 18, 21, 24, 26], brain structures [9, 15, 17, 19, 20, 2225], and cardiovascular regions [811, 13, 1517, 19, 20, 23, 25, 26]. Recent advances have further expanded this technique to segment orbit [27] and abdominal regions [14, 27, 28]. These image segmentation algorithms have played a role in many biomedical applications including the quantification of tissue volumes, diagnosis, localization of pathology, study of anatomical structures, treatment planning, and computer-integrated surgery [8, 28]. However, ANNs have yet to be used specifically for orthopedic applications.

The objective of this study was to develop tools for automating the identification of bony structures, to assess the reliability of this technique against manual raters, and to validate the resulting defined regions of interest against physical surface scans obtained from the same specimen. For this study, an ANN was created and trained to recognize the phalanx bones of the human index finger.

The phalanges were chosen as the structures of interest for several reasons. First and foremost, the ability to collect a large amount of data (i.e., multiple bones) in a single scan as opposed to a single bone, made the phalanges an ideal choice. Furthermore, by initiating with the bones of the human hand, the bounds of the methodology were being tested. Each finger (excluding the thumb) consists of three long, slender bones: the proximal, middle, and distal bones. These bones are small and within close proximity to one another. This close proximity tested the ability of the ANN to distinguish a bone from its adjacent neighbor(s). Lastly, the geometric similarities of the individual bones would provide the ability to test bones of similar shape, but differing in size, which would help account for variability among individuals. Tools capable of handling such complexities of this region make them readily applicable to virtually any other structure/joint of the human body.

Materials and methods

Fifteen arms from 8 donors, amputated at the elbow, were obtained from Anatomy Gifts Registry located in Hanover, MD, USA. The donor set consisted of 13 female and 2 male specimens with a mean age of 73.7 years. Aside from being thawed for the scanning procedures, the specimens were frozen and stored at −20°C.

Each of the arms came with a ten-character identifying code. The ten characters included two letters followed by eight digits. These codes became the basis for the identifying scheme used for the project. The fingers were numbered sequentially beginning with the index finger, labeling the index finger 1, the middle finger 2, the ring finger 3, and the little finger 4. The letter P was used to represent a proximal phalanx bone, M to represent a middle phalanx bone, and D to represent a distal phalanx bone. Finally, either the letter R or the letter L was assigned to the identifying code to indicate a right or left hand.

CT data acquisition

For conformity during the scanning procedure, each specimen was individually affixed to a customized Plexiglas fixture. The neutral hand position was defined by placing the palm of the hand on the flat surface of the construct and aligning the third metacarpal with the long axis of the forearm [29]. A dashed line, drawn longitudinally down the center of the fixture was used to align the specimen. An outline of a human hand was also included on the construct for ease of alignment. Nine holes were drilled in the fixture, one between each finger and one on either side of the thumb and wrist. Locking cable ties were used to strap the fingers and wrist securely to the Plexiglas plate.

Images of each specimen were obtained on a Siemens Sensation 64 CT scanner (matrix = 512 × 512, FOV = 172 mm, KVP = 120, current = 94 mA, exposure = 105 mA) with an in-plane resolution of 0.34 mm and a slice thickness of 0.4 mm. Following image acquisition, the data were processed using BRAINS2 software [30, 31]. The images were resampled to 0.2-mm isotropic voxels and spatially normalized such that the vertical plane of the frame was aligned superiorly/inferiorly in the coronal view, vertically aligning the third metacarpal. The images were cropped to contain only the phalanx and carpal bones for ease of data management.

Manually segmented surface definitions

Two trained technicians (referenced as Tracer1 and Tracer2) manually traced 15 index fingers using BRAINS2 software. The regions of interest (ROIs) defining the distal, middle, and proximal bones were manually traced by each technician. The average time required to manually segment the three bones of the index finger was 58.5 min, ranging from 40 to 83 min. In order to ensure minimal inter-rater variability, a study was conducted to compare the performance of the two tracers by determining the relative overlap (Eq. 1).

$$ {\text{ }}Relative Overlap = \frac{{Volume{\left( {Tracer1 \cap Tracer2} \right)}}} {{Volume{\left( {Tracer1 \cup Tracer2} \right)}}} . $$
(1)

The relative overlap computed between the two raters was 0.89 for all the bones. The individual bones (proximal, middle, and distal) had overlaps of 0.91, 0.90, and 0.87 respectively.

Architecture of the ANN

The ANN algorithm was implemented in four stages:

  1. 1.

    Probability map generation

  2. 2.

    Creation of training vectors

  3. 3.

    Neural network training

  4. 4.

    Application of the neural network [32]

A fully connected, feed-forward, three-layer ANN was used in this study. The architecture for the ANN consisted of 27 input elements (3 probability values, 3 spherical coordinates, and 21 signal intensity values), 81 hidden elements, and 3 output elements. The 21 signal intensity values consist of 9 signal intensity values along the largest gradient including the current voxel under examination, and 12 signal intensity values surrounding the voxel under consideration (±2 voxels along the x, y, and z axes). The three probability values represent the likelihood that each of the bones (proximal, middle, and distal) exist at a given spatial location. The three outputs were used to define each of the three bones under consideration. Standard backpropagation was used for training. Of the 15 manually segmented index fingers, 10 of the fingers were used to generate the probability map and train the ANN. The remaining 5 fingers were used to evaluate the reliability of the ANN.

Probability map generation

One of the inputs to the neural network is the probability of each structure being segmented at a given spherical coordinate location in the atlas space. Ultimately, the ANN does not consider locations with zero probability. The neural network was trained to define all three phalanx bones simultaneously, thus allowing each finger to be completely segmented with a single pass of the neural network. The index finger of specimen MD05010306R was chosen as the atlas space. Since the size varies across fingers and specimens, the fingers were scaled to the size of the atlas finger. To perform the scaling, a bounding box was placed around the fingers and the corners of the extracted image volume were used as the landmark locations to define a thin-plate spline registration [33]. This registration removed global scaling differences between the fingers. To allow both the right and the left hands to contribute to the probability map, the landmarks for the left hands had the right and left corners swapped, resulting in a mirror image. Once the global scaling was removed, all the fingers in the dataset were registered to the atlas image using a Thirion’s demons registration [34]. After registration, the manually defined binary segmentations were warped using the resulting deformation field. Finally, the warped masks were used to generate a probability map that was then filtered using a Gaussian filter of 0.2 mm in size. In addition to smoothing the probability map, the Gaussian filter introduced dilation, which allowed unusual bony geometries to be considered by the neural network.

Neural network training

The second step of the ANN application was the creation of the training vectors. Essentially, this created a file with known inputs (signal intensity, probability information, and spatial location) and outputs based on manually defined regions. These data were utilized in the third stage, the training of the ANN. The first two stages were generally quick and could be accomplished in approximately 1–2 h depending on the number of specimens in the training set, the image size, and the speed of the computer. The third stage, the training of the neural net, was the most time-consuming stage of the algorithm. On average it took 2 days to train the network. When the mean square error reached an asymptote, the training was terminated and the neural network weights were saved. The weights were saved every 25 iterations, allowing the reliability to be studied as a function of training by assessing the ability of the resulting neural network to generalize to the testing set. For this project, approximately one million training vectors were used. The learning rate and momentum for the backpropagation training were 0.3 and 0.15 respectively. An asymptote during training was reached within 250 iterations of all one million training vectors.

Application of the neural network

To segment a new scan, only the final stage was required. This stage is relatively fast and can be accomplished in approximately 5 min. This stage warps the probability information from the atlas space to the current scan. Only voxels within the image that have a non-zero probability were used to generate the input vectors for the ANN. The saved weights of the networks were loaded and applied to the current input vectors. The output is a binary mask for each structure in the network configuration [32]. The binary mask is generated by first filtering the network activation function using a Gaussian filter of 0.05 mm and then thresholding the resulting image at 0.5.

Reliability and validation of the neural network

Once the ANN was trained on the ten training images, the reliability of the network was evaluated using the five scans designated the testing set. The resulting binary segmentations were compared with the traces from the manual raters. Relative overlap (Eq. 1) was used to compare the two defined regions of interest. In addition to measuring the reliability of the neural network, the validity of the ANN segmentation was also determined. Once the region of interest has been segmented from the source image data set, a triangulated isosurface of the bone was generated and exported in stereo-lithography (STL) format. This neural network-driven bony surface representation was then compared with the three-dimensional physical surface (laser) scan of the corresponding cadaveric specimen. Four of the reliability index fingers were available for physical surface scanning. These four specimens were prepared and scanned as described in the next section. The various weights generated by the ANN during training were assessed and it was determined that the weights corresponding to 250 iterations produced superior results to those generated with less training. The results from the 250 iterations of training are reported here.

Cadaveric preparation and surface scanning procedures

In preparation for the physical scanning procedure, the bones were carefully dissected from each hand. Care was taken not to alter the bony surface during dissection due to nicking or scratching by instruments. The majority of the surrounding soft tissue was removed during dissection. What tissue remained post-dissection was removed following the defleshing process prescribed by Donahue et al. [35]. The bones were placed in a 5.25% sodium hypochlorite (bleach) solution for approximately 4–6 h to remove the remaining tissue [35]. The bones were examined hourly to avoid decalcification and to remove any extraneous loose tissue. Once denuded, the bones were degreased in a soapy water solution followed by a period of air-drying. Due to the natural color and porous texture of bone, a negligible layer of white primer was applied to the surface prior to scanning, thereby improving the scanner’s ability to detect the bony surface. Three-dimensional surface scans of each physical specimen were ascertained using a Roland LPX-250 3D laser scanner (0.2-mm resolution).

Surface comparisons

In order to compare the ANN and laser-scanned surface definitions, the axes of the bony surface representations were aligned to account for the differences in the axes of the laser and CT scanners. The axes of the ANN surfaces were oriented to correspond with the axes of the scanned surface. Once reoriented, the ANN surface was co-registered to the laser-scanned surface using a rigid, iterative closest point (ICP) algorithm [36] that was initialized by aligning the center of masses for the two surfaces. After the surfaces were in correspondence, the distance between the two surfaces was measured. A distance map was created for the surface based on the Euclidean distance metric, or the shortest distance from a source point to a target surface [37]. The source was always the physical surface scan, while the ANN surface representations were considered the target surfaces. This enabled each specimen to act as its own control to establish the validity of the ANN-defined regions of interest.

Results

The reliability of the neural network for the proximal, middle, and distal phalanges of the index fingers is summarized in Table 1. The relative overlap between the ANN and a manual tracer was 0.87, 0.82, and 0.76, for the proximal, middle, and distal index phalanx bones respectively. The average relative overlap for an entire index finger was 0.82. Specimen CA05042125L had the greatest overlap in the study, with a result of 0.87 for the entire finger. A visual comparison between the neural network output and the manual segmentation is shown in Fig. 1 for one specimen in the reliability set.

Table 1 Overlap of manual and neural network segmentation
Fig. 1
figure 1

a, b Two coronal and c, d two sagittal views of the manual (red) and automated (blue) regions of interest for specimen CA05042125L

After the reliability between the neural network and the tracers was determined, the validity of the neural network output was evaluated. The average distances between the laser-scanned surfaces and the ANN-generated surfaces are shown in Table 2. The middle and proximal phalanx bones had the smallest average distances, measuring 0.29 mm and 0.35 mm respectively, while the average distance of the distal phalanx bones was 0.40 mm. The index finger referenced as CA05042125L had the smallest average distance overall at 0.28 mm, while finger MD05042226L had the largest average distance of 0.46 mm.

Table 2 Average distances (standard deviation) between the laser-scanned surfaces and artificial neural network (ANN)-generated surfaces

A distance map between the 3D physical surface scan and the ANN output is shown in Fig. 2. The blue color represents a complete intersection or a crossing of the surfaces, while any discrepancy greater than 1.0 mm is represented in white.

Fig. 2
figure 2

Representative distance maps between the artificial neural network (ANN) output and the 3D physical surface scans of the a distal, b middle, and c proximal phalanges. Distances are represented in mm

Discussion

This initial evaluation of ANN-based segmentation of bony regions of interest shows great promise. Even in the relatively small phalanx bones of the finger, the neural network provided reliable estimates (average relative overlap of 0.82) of the bony regions. Furthermore, the ANN proved to segment the structures in less than one-tenth of the time (on average) required for a manual rater to define these structures and only required the user to define a bounding box around the entire finger. One specimen (MD05042226L) demonstrated poor reliability, especially of the distal phalange. Upon inspection, a bony growth was clearly evident. This abnormality was not present on any of the training images, and hence a portion of this bone exceeded the bounds of the probability map. To reliably apply the network to such pathological regions in the future, additional training images would be required that exhibit this condition.

In addition to providing a reliable and automated estimate of the phalanx bones, the neural network also generated a valid representation of the bone surface. When compared with a three-dimensional surface scan of the same specimen, the surface was on average the distance of a voxel from the physical laser scan. The average distances for the proximal, middle, and distal phalanges were 0.35 mm, 0.29 mm, and 0.40 mm respectively. We have previously reported on the validation of manual raters to define these regions compared with the physical laser scan [38]. The average distances for the manual raters were 0.19, 0.20, and 0.21 mm for the proximal, middle, and distal phalanges respectively. Orthopedic imaging provides a unique opportunity to evaluate the validity of automated segmentation algorithms since the bony regions of interest can be extracted from cadaveric specimens and scanned using a 3D surface scanner.

While the initial results are promising, we believe that there is room for further improvement of the results, allowing the neural network to provide the same reliability as manual raters. The neural network parameter space has not yet been fully explored. Several parameters such as the smoothing of the probability map and the network configuration (number of hidden nodes) could be tuned to further improve these results. These issues will be explored to optimize the network architecture for the segmentation of orthopedic regions of interest. The training images for the neural network should also include a wider subject population that incorporates different pathologies. Ultimately, the neural network should be able to discriminate between fine differences in finger geometry regardless of pathology. Expansion of this project will also attempt to include segmentation from different imaging modalities such as MRI. In addition, other artificial segmentation algorithms such as support vector machines (SVM) will be explored. The advantage of SVM algorithms is that only the input features need to be selected. For this study, the ANN architecture was chosen over other machine learning algorithms such as SVM for two reasons. First, we have a previous history of applying the ANN to segment regions of the brain [17, 39]. Second, we have compared the reliability of the ANN- and SVM-based segmentation algorithms for segmentation of brain regions and have found that the algorithms performed similarly, with the ANN being faster for segmentation after the initial training was completed [39]. By using automated methods such as the ANN for segmentation, the likelihood of rater drift and inter-rater variation is eliminated. Automated methods also decrease the amount of time and manual effort required to extract the data of interest. The prohibitive barrier of time would no longer be an issue in 3D model development, allowing patient-specific modeling to become a reality.