Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction and Related Work

Clinical practice today, especially whole-body CT and MRI scanning, often generates large numbers of high-resolution images, which makes tasks of efficient data access, transfer, analysis and visualization challenging, especially in distributed computing environments which have seen growing use of handheld terminals for interactive data access and visualization of anatomy. Therefore, there is great interest for efficient and robust medical image segmentation algorithms for the purposes of creating patient-specific anatomical models, clinical applications, medical research and education, and visualization of full-body anatomy [2].

Traditionally single-object or pathology oriented, recent image processing methods [7, 911, 14, 16, 17] have made the analysis and segmentation of multiple anatomical structures increasingly possible. However, CT and MR images have intrinsic characteristics that render its automatic segmentation challenging. They are commonly degraded by various noise sources and artifacts due to limited acquisition time and resolution, and patient motion which all reduce the prominence of intensity edges in images. Regardless of the imaging modality and related artifacts, many anatomically and functionally distinct structures, especially those corresponding to soft tissues, have similar intensity levels in images and, furthermore, blend into surrounding tissues which have intensities close to their own. It is impossible to identify and segment such structures automatically on the basis of intensity information only. Hence, most advanced segmentation methods exploit some form of prior information on structure location [14, 18] or interrelations [7, 10, 16, 17] to achieve greater robustness and precision.

Graph Cut methods, which have been widely applied to single-object segmentation problems [3], rely on a maximum-flow binary optimization scheme of a discrete cost function on the image graph. For a particular class of cost functions which frequently arises in segmentation applications [12], these methods produce provably-good approximate solutions in multiobject [4] and global optima in single-object segmentation. In addition, simultaneous multiobject segmentation approaches are superior to their sequential counterparts in that they raise questions neither on the best segmentation sequence to follow nor on how to avoid the propagation of errors on individual segmentations [7].

We propose a generic method for automatic multiple-organ segmentation based on multilabel Graph Cut optimization which uses location and intensity likelihoods of organs and prior information of their spatial configuration. The spatial prior is derived from shortest-path pairwise constraints defined on the adjacency graph of structures [10], and the organ location likelihood is defined by probabilistic atlases learned from the Visceral Benchmark training dataset [8]. We register organ atlases to the image prior to segmentation using a fast (2+1)D registration method based on SURF keypoints [1]. Registered atlases are also used to derive organ intensity likelihoods. Prior and likelihood models are then introduced in a joint centroidal Voronoi image clustering and Graph Cut multiobject segmentation framework. We present the results of qualitative and quantitative evaluation of our method on contrast-enhanced CT images from the Visceral Benchmark dataset.

2 Methods

2.1 SURF Keypoint-Based Image Registration

We first outline our fast (2+1)D rigid registration method, which is based on keypoints detection. Features are extracted in 2D volume slices. This has the advantage of being fast and easily parallelizable. Another advantage is that medical data is usually stored in a Picture Archiving and Communication System (PACS) as volume slices instead of full volumes. Our method easily fits into such medical environments. Note that while feature extraction is done in 2D images, registration is still performed in 3D, hence the (2+1)D notation.

We currently use the SURF image descriptor [1], but our approach is generic and would work with others. To reduce computation time, we first downsample the input volume to a user-specified dimension. As a rule of thumb, we isotropically resample each volume so that its second longest dimension is equal to the desired resolution \(R\). For example, when \(R = 80\), the POPI CT volume [19] of dimensions 482 \(\times \) 360 \(\times \) 141 and spacing 0.97 mm \(\times \) 0.97 mm \(\times \) 2 mm is resampled to a 107 \(\times \) 80 \(\times \) 64 volume with an isotropic spacing of 4.39 mm.

Fig. 1.
figure 1

Matching two slices of POPI (a) and P1 (b) volumes. (a) and (b) show all features found in both slices (a feature is represented by a circle and its orientation by its radius). (c) and (d) show the three matching features between the two slices.

Next, we extract 2D SURF features from each slice. As these operations are completely independent, this step is carried out in a parallel manner. Figure 1 shows feature extraction on POPI and P1 CT images [15]. The number of extracted features is 1154 and 1079, respectively. Figures 1c and d show the three matching couples found in both images. Once 2D matches are found, we are able to perform volume registration. For robustness purposes, we use a simple scale \(+\) translation transformation model:

$$\begin{aligned} \left[ \begin{matrix} x' \\ y' \\ z' \end{matrix} \right] = s \,\left[ \begin{matrix} x \\ y \\ z \end{matrix} \right] + \left[ \begin{matrix} t_x \\ t_y \\ t_z \end{matrix} \right] \,. \end{aligned}$$
(1)

We estimate the 4 parameters \(s\), \(t_x\), \(t_y\) and \(t_z\) in similar spirit to the RANSAC method [6], which is an iterative parametric model estimation method known to be very efficient in the presence of outliers. One RANSAC iteration usually consists in randomly picking a small number of samples to estimate the model parameters, and then counting the number of data samples consistent with the model, rejecting outliers. After performing all iterations, the model providing the highest number of consistent data samples is kept as the solution.

2.2 Organ Atlas Construction

Using 20 contrast-enhanced CT images and ground-truth annotations thereof from the Visceral Benchmark dataset [8], we construct a probabilistic atlas for each of the following 20 structures: thyroid; trachea; sternum; liver; spleen; pancreas; gallbladder; first lumbar vertebra; aorta; urinary bladder; right and left lungs, kidneys, adrenal glands, psoas major and rectus abdominis muscle bodies. In addition, we create atlases for three additional image and body regions: background (BKG), thorax and abdomen (THAB) and a body envelope (ENV) from annotations generated automatically as follows. BKG is created by thresholding the CT image followed by morphological processing in order to isolate the body region. THAB is created as the dilated union of the aforementioned 20 structures and their bounding 3D ellipse, from which the structures are subtracted after dilation. Finally, ENV is defined as the image minus BKG and THAB. Note that ENV is a crude body envelope that comprises skin, fat, muscle and bone structures. Refer to Fig. 4c for an illustration.

To create atlases, we use a representative image from the dataset as a reference and register remaining images to it via the method described in Sect. 2.1. We register each structure separately in a bounding cube of a given margin in the intensity image, defined according to the corresponding annotation image, and apply the obtained transform subsequently to the annotation image. We accumulate annotations thus registered in a 3D histogram of reference image dimensions which is normalized to produce the corresponding probability map.

Fig. 2.
figure 2

An image-adaptive CVT clustering and its dual graph for a circle image.

2.3 Image Clustering

The full-resolution voxel representation is often redundant because objects usually comprise many similar pixels that could be grouped. Therefore, we simplify the image prior to segmentation by an image-adaptive centroidal Voronoi tessellation (CVT) which strikes a good balance between cluster compactness and object boundary adherence, and helps to place subsequent segmentation boundaries precisely. We have shown that the clustering step improves the overall runtime and memory footprint of the segmentation process up to an order of magnitude without compromising the quality of the result [10].

Let us define a grayscale image as a set of voxels \(\mathcal {I} = \{v \, | \, v=(x,y,z)\}\) and associate to each voxel \(v\in \mathcal {I}\) a gray-level \(I_{v}\) from some range \(I\subset \mathbb {R}\). Given a grayscale image \(\mathcal {I}\) and \(n\) sites \(c_i \in \mathcal {I}\), a CVT partitions \(\mathcal {I}\) into \(n\) disjoint clusters \(C_i\) associated with each centroid \(c_i\) and minimizes the following energy:

$$\begin{aligned} F(v;c_i)=\sum _{i=1}^n\left( \sum _{v\in C_i}\rho (v)\left( \Vert v - c_i\Vert ^2 + \alpha \Vert I_v - I_i\Vert ^2\right) \right) \,. \end{aligned}$$
(2)

In (2), \(\rho (v)\) is a density function defined according to the intensity-gradient magnitude at voxel \(v\), \(\rho (v)=|\nabla I_v|\), \(\alpha \) is a positive scalar and \(I_i\) is the gray-level of the cluster \(C_i\) defined as the mean intensity of its voxels. Refer to Fig. 2 for an illustration in 2D. To minimize (2), we apply a variant of the clustering algorithm in [5] which approximates a CVT in a computationally-efficient manner, involving only local queries on voxels located on boundaries of pairs of clusters.

For referral in later sections, we shall define the graph of a CVT, illustrated in Fig. 2b. Denote the surface of a cluster \(C_i\) by \(\partial C_i\). Given a CVT clustering \(\mathcal {C}\), let the set \(\mathcal {S}\) index its clusters, and let \(\mathcal {G = \langle S,E\rangle }\) be an undirected graph on cluster centroids where pairs of clusters having nonzero-area common surface define the set of edges \(\mathcal {E} = \bigl \{ \{i,j\} \mid i, j \in \mathcal {S}, \;|\partial C_i \cap \partial C_j| \ne 0\bigr \}\). Consequently, the neighborhood of a node \(i \in \mathcal {S}\) is defined as \(\mathcal {N}_i = \bigl \{j \mid j\in \mathcal {S},\,\exists \,\{i,j\} \in \mathcal {E}\bigr \}\).

2.4 Multi-Organ Image Segmentation

We formulate image segmentation as a labeling problem, defined as the assignment of a label from a set of 23 labels \(L\) representing the structures to be segmented to each of the variables in a set of \(n\) variables, indexed by \(\mathcal {S}\), corresponding to the clusters of a CVT-clustered image. Assume that each variable \(i\in \mathcal {S}\) is associated with the corresponding node in the graph \(\mathcal {G}\) of the CVT defined in Sect. 2.3. An assignment of labels to all variables is called a configuration, and is denoted by \(\ell \in \mathcal {L}\). An assignment of a label to a single variable is denoted by \(\ell _i\). We cast the labeling problem in a maximum a posteriori estimation framework and solve it by minimizing the following energy function of label configurations via the Expansion Moves multilabel Graph Cut algorithm [4]:

$$\begin{aligned} E(\ell ) = {t_1\sum _{i\in \mathcal {S}}D_i(\ell _i)}+{t_2\sum _{i\in \mathcal {S}}P_i(\ell _i)}+{\frac{1}{2}\sum _{i\in \mathcal {S}}\sum _{j\in \mathcal {N}_i}V_{i,j}(\ell _i,\ell _j)}\,. \end{aligned}$$
(3)

In (3), \(t_1\) and \(t_2\) are temperature hyperparameters, \(\mathcal {N}_i\) is the neighborhood of the variable \(i\in \mathcal {S}\). The first and second sums in (3) correspond respectively to organ intensity and location (atlas) likelihood energies, and the third is the energy of a prior distribution of label configurations expressed as a Markov random field w.r.t. the graph \(\mathcal {G}\). We shall define these terms in detail.

Spatial Configuration Prior. Pairwise terms of (3) encode prior information on interactions between labels assigned to pairs of neighboring variables encouraging the spatial consistency of labeling with respect to a reference model. We define these terms according to our piecewise-constant vicinity prior model proposed in [10], which, unlike the standard Potts model, incurs multiple levels of penalization capturing the spatial configuration of structures in multiobject segmentation. It is defined as follows. Let \(\mathcal {R}\) be the set of symmetric adjacency relations on pairs of distinct labels, \(\mathcal {R}=\{\mathrm {r}\mid a\,\mathrm {r}\,b,\, a,b\in L,\, a\ne b\}\). \(\mathcal {R}\) can be represented by a weighted undirected graph on \(L\), \(\mathcal {A} = \langle L,W\rangle \), with the set of edges \(W = \bigl \{\{a,b\}\mid \exists \mathrm {r}\in \mathcal {R},\,a\,\mathrm {r}\,b,\,a \ne b\bigr \}\) where edge weights are defined by \(w\left( \{a,b\}\right) = 1\), such that \(w\left( \{a,b\}\right) = \infty \) if \(\not \exists \mathrm {r}\in \mathcal {R},\,a\,\mathrm {r}\,b\).

Given the graph \(\mathcal {A}\), we define the pairwise term in (3) as:

$$\begin{aligned} V_{i,j}\bigl (\ell _i,\ell _j\bigr ) = |\partial C_i \cap \partial C_j|\,\varpi \bigl (a,b\bigr ),\quad \ell _i = a,\, \ell _j = b\,. \end{aligned}$$
(4)

where \( \varpi \bigl (a,b\bigr )\) is the shortest-path weight from \(a\) to \(b\) in \(\mathcal {A}\). The area of the common surface of adjacent clusters \(|\partial C_i \cap \partial C_j|\) is introduced so that the sum of pairwise energies in (3) \(\forall a,b\in L\) is equal to the area of the common surface between the corresponding pair of objects multiplied by the shortest-path weight.

Fig. 3.
figure 3

(a) An illustration of our hierarchical registration procedure, and (b) an example of registered organ atlases overlaid on the image.

Intensity and Location Likelihoods. Unary terms of (3) measure the cost of assigning labels to variables. They are defined as negative log-likelihood functions derived from organ observed intensity and location probabilities:

$$\begin{aligned} D_i(\ell _i)&= -\ln \prod _{v\in C_i}\Pr (I_v\mid \ell _i)\,,\end{aligned}$$
(5a)
$$\begin{aligned} P_i(\ell _i)&= -\ln \prod _{v\in C_i}\Pr (X_v\mid \ell _i)\,. \end{aligned}$$
(5b)

In (5b), \(X_v\) denotes the object-space coordinates of the voxel \(v\). Conditional probabilities in (5a) and (5b) correspond respectively to those of voxel intensity and location given the structure \(\ell _i\). To estimate the conditional probability distribution \(\Pr (I\mid l)\) for a given label \(l\in L\), we first register the corresponding organ atlas to the image, then estimate the conditional probability as a Gauss-smoothed and normalized intensity histogram derived from voxels in high-probability regions of the registered atlas. Conditional probability distributions \(\Pr (X\mid L)\) are defined directly from registered atlases. The next section details our hierarchical registration approach which maps organ atlases to the image.

Hierarchical Registration of Organ Atlases. We register probabilistic organ atlases, constructed as described in Sect. 2.2, to the image in a 3-step hierarchical fashion starting at the full image, then on an intermediate level corresponding to the THAB region, and finally on individual organs. After performing registration on each level, we apply the obtained transform to the corresponding atlas as well as to those of organs contained in the registered region. As in Sect. 2.2, we register each structure separately in a bounding cube of a given margin in the intensity image, defined according to the corresponding atlas. Figure 3 illustrates the hierarchical registration procedure and gives an example of registered organ atlases overlaid on the image to which they have been registered.

Fig. 4.
figure 4

A segmentation of 12 structures from the Visceral Benchmark dataset image 10000109_1_CTce_ThAb. The coronal sections correspond to (a) the image, (b) its segmentation and (c) the associated ground-truth with additional labels for BKG, ENV and THAB. (d) illustrates the adjacency graph used to define the spatial prior model.

Table 1. Mean Dice figures for select organs calculated on 10 segmentations per organ.

3 Results and Conclusions

We have carried out qualitative evaluation on several contrast-enhanced CT images from the Visceral Benchmark training dataset [8]. An example is given in Fig. 4. The number of CVT clusters is equal to \(3\,\%\) of image voxel count. Temperature parameters \(t_1\) and \(t_2\) are set such that intensity and location likelihood-based unary terms contribute equally to (3). The spatial prior is defined according to the adjacency graph given in Fig. 4d. Table 1 presents the results of quantitative evaluation of our method on contrast-enhanced CT images during the Visceral Anatomy 2 Benchmark. We report results corresponding to the best setting of temperature parameters out of 5 tested settings. In addition, we give mean Dice figures only for organs for which our method successfully produced segmentations on all 10 test images.

We conclude on two remarks. (1) Even though our hierarchical approach to mapping atlases to the image relies on a rigid registration method, unlike many hierarchical methods which use non-rigid deformable registration [13], it helps localizing segmented structure boundaries quite well, because location information roughly registered atlases contribute is complemented by intensity similarity and spatial consistency criteria. (2) Full-body modeling by the introduction of annotations BKG, ENV and THAB, not only complements location information and allows for hierarchical registration, but also increases the discriminative power of the spatial prior by higher penalization of inconsistent configurations.