Introduction

Hippocampus is an important subcortical structure whose function is associated with learning and memory (den Heijer et al. 2012). Volumetric analysis of the hippocampus based on magnetic resonance imaging (MRI) has been widely adopted in studies of neurological diseases, such as epilepsy (Akhondi-Asl et al. 2011) and Alzheimer’s disease (Wolz et al. 2014). However, manual segmentation of the hippocampus from MRI brain images is time consuming (Carmichael et al. 2005) and suffers from high intra-operator and inter-operator variability (Chupin et al. 2007). Therefore, automatic and reliable segmentation of the hippocampus from MR brain images has been a hot research topic in medical image analysis.

In the last decade, multi-atlas based image segmentation (MAIS) methods have been developed and widely adopted in studies of the hippocampus segmentation (Warfield et al. 2004, Heckemann et al. 2006, Artaechevarria et al. 2009, Dill et al. 2015, Iglesias and Sabuncu 2015). A typical MAIS method consists of three steps: atlas image selection, atlas image registration, and segmentation label fusion. In the atlas image selection step, a subset of atlas images is selected for a given target image based on a pre-defined measurement of anatomical similarity, usually according to image intensities, e.g., sum of squared differences, correlation, or mutual information (Aljabar et al. 2009, Xie and Ruan 2014, Yan et al. 2015). In the atlas image registration step, the spatial correspondence between each atlas image and the target image is determined and the atlas images and their corresponding label maps are aligned to the target image (Lötjönen et al. 2010, Doshi et al. 2016). Finally, in the segmentation label fusion step, the warped label maps are fused to get a consensus label map for the target image (Warfield et al. 2004, Artaechevarria et al. 2009, Coupé et al. 2011, Hao et al. 2014).

Although a variety of atlas image selection strategies and different image registration techniques can be adopted in an MAIS method, the existing MAIS methods are typically characterized by their label fusion strategies. Among the existing label fusion strategies, weighted voting label fusion methods have attracted considerable attention. Assuming that the image registration from atlas images to the target image is reliable, traditional weighted voting label fusion strategies combine the corresponding labels based on predefined weighting models (Rohlfing et al. 2004, Heckemann et al. 2006, Artaechevarria et al. 2009, Sabuncu et al. 2010). The simplest method might be the majority voting which assigns a constant weight value for all atlases (Rohlfing et al. 2004, Heckemann et al. 2006). Better segmentation performance can be obtained with more sophisticated voting strategies, such as local weighted voting with inverse similarity metric (Artaechevarria et al. 2009) and local weighted voting with Gauss similarity metric (Sabuncu et al. 2010). It has been shown that local weighted voting strategies outperform global methods in segmenting high-contrast structures, but global techniques are less sensitive to noise when contrast between neighboring structures is low (Artaechevarria et al. 2009). Some of the weighted voting label fusion methods can be seen as special cases of a probabilistic generative model (Sabuncu et al. 2010).

Due to inter-subject anatomical variability, the registered atlas images are not always aligned with the target image perfectly. The image registration errors may hamper the label fusion if it is based on local image similarity measures with an assumption that voxel to voxel correspondence exists between atlas images and the target image. Such a problem can be effectively overcome by nonlocal patch based weighted voting methods (Coupé et al. 2011, Rousseau et al. 2011). In the nonlocal patch based weighted voting methods, all voxels in a searching region are selected and patches centered at these voxels are extracted as image patches in each warped atlas image. Voting weights are then computed according to the intensity similarities between the atlas image patches and the target image patch.

Many approaches have been proposed to obtain weighting coefficients for improving segmentation accuracy and robustness of the nonlocal patch based weighted voting methods, for example reconstruction based methods (Liao et al. 2013, Wu et al. 2014) and joint label fusion (JLF) method (Wang et al. 2013). Reconstruction based methods computed the reconstruction coefficients of the target patch from a patch library by sparse representation (Liao et al. 2013) or local independent projection (Wu et al. 2014), and then used them to combine atlas labels to label the target voxel. Since different atlases may produce similar label errors (Wang et al. 2013), the JLF method minimized the total expectation of labeling error by explicitly modeling pair-wise dependency between atlases as a joint probability of two atlases that make similar segmentation errors.

The existing MAIS methods typically measure the similarity of image patches based on Euclidean distance metric. However, Euclidean distance metric is not necessarily optimal for the label fusion since they do not characterize any statistical distributions of image intensities in the patches. The statistical distributions of image intensities could be estimated from the atlas images and their associated segmentation labels, but might vary at different image locations. It has been reported that patches with similar intensity values may have different segmentation labels, which will lead to segmentation errors in MAIS methods (Bai et al. 2015). To overcome this problem, we present a kernel classification method for metric learning such that image patches of the same structure keep close to each other and those of different structures are separated. With the obtained metric, we develop an optimal nonlocal weighted voting label fusion method. We have validated the proposed method for segmenting the hippocampus from MRI brain images, and compared our method with state-of-the-art MAIS techniques, including majority voting method (MV) (Rohlfing et al. 2004, Heckemann et al. 2006), local weighted voting with Inverse similarity metric (LW-INV) (Artaechevarria et al. 2009), local weighted voting with Gauss similarity metric (LW-GU) (Sabuncu et al. 2010), nonlocal patch based weighted voting with Gauss similarity metric (NLW-GU) (Coupé et al. 2011, Rousseau et al. 2011), local label learning (LLL) (Hao et al. 2014), and the JLF method (Wang et al. 2013). The experimental results have demonstrated that our method could achieve better segmentation performance than the state-of-the-art MAIS methods.

Materials and Methods

Image Dataset

The proposed algorithm was validated for segmenting the hippocampus based on the first release of EADC-ADNI dataset, consisting MRI scans and their corresponding hippocampus labels of 100 subjects (www.hippocampal-protocol.net). These images were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI, RRID:SCR_003007) database (adni.loni.usc.edu/), and the subjects are from 3 diagnosis groups, including normal controls (NC), mild cognitive impairment (MCI), and patients with Alzheimer’s disease (AD).

The Principal Investigator of the ADNI is Michael W. Weiner, MD, VA Medical Center and University of California-San Francisco. ADNI is the result of efforts of many co-investigators from a broad range of academic institutions and private corporations, and subjects have been recruited from over 50 sites across the U.S. and Canada. The initial goal of ADNI was to recruit 800 adults, ages 55 to 90, to participate in the research, approximately 200 cognitively normal older individuals to be followed for 3 years, 400 people with MCI to be followed for 3 years, and 200 people with early AD to be followed for 2 years. For up-to-date information, see www.adni-info.org.

Each of the MRI brain images was manually labeled according to a harmonized protocol (Boccardi et al. 2015). All images have been processed using a standard preprocessing protocol, including alignment along the line passing through the anterior and posterior commissures of the brain (AC-PC line) and bias field correction. And they have been warped into the MNI152 template space using linear image registration with affine transformation. We randomly select 40 subjects as training set, and other 60 subjects as testing set. Clinical scores and demographic information of these subjects are summarized in Table 1.

Table 1 Demographic data and clinical scores of the subjects

Metric Learning for Multi-Atlas Based Image Segmentation

Given a target image I, and N atlases \( {\overset{\sim }{A}}_i=\left({\overset{\sim }{I}}_i,{\overset{\sim }{L}}_i\right),i=1,2,\dots, N \), where \( {\overset{\sim }{I}}_i \) is the i-th image and \( {\overset{\sim }{L}}_i \) is its segmentation label with value 1 indicating foreground and 0 indicating background, the multi-atlas segmentation method registers each atlas image \( {\overset{\sim }{I}}_i \) to the target image and propagates the corresponding segmentation \( {\overset{\sim }{L}}_i \) to the target space, resulting N warped atlases A i  = (I i , L i ) , i = 1 , 2 ,  …  , N. Then, it infers the label of each voxel of the target image from the warped atlases. Figure 1 shows a flowchart for segmenting an image with a typical multi-atlas image segmentation method.

Fig. 1
figure 1

The flowchart for segmenting a target image with the multi-atlas based image method

Identification of a Bounding Box of Hippocampus

Since all images were aligned to the MNI152 template using linear image registration with affine transformation and resampled to have voxel size of 1x1x1mm3, a bounding box can be identified for both the left and right hippocampus to cover the hippocampus of unseen target image. In particular, we scan all the atlases to find the minimum and maximum x, y, z positions of the hippocampus and add 7 voxels in each direction to cover the hippocampus of unseen testing images.

Atlas Selection and Image Registration

For each target image, we select the top 20 most similar atlases based on normalized mutual information (NMI) between the target image and the atlas images within the bounding box (Hao et al. 2014). After the atlas selection, we register each atlas image to the target image using a nonlinear, cross-correlation-driven image registration algorithm, namely ANTs (Avants et al. 2008), with the following command: ANTS 3 -m CC [target.nii, source.nii, 1, 2] –i 100x100x10 –o output.nii -t SyN[0.25] –r Gauss[3,0]. The nonlinear registration was applied to the image blocks within the bounding box.

Initial Segmentation with Majority Voting

To reduce the computational cost, we adopt the majority voting based label fusion to obtain an initial segmentation result of the target image. For each voxel, the output of the majority voting label fusion is a probability value of the voxel belonging to the hippocampus. The segmentation result of voxels with 100 % certainty (probability value of 1 or 0) can be directly taken as the final segmentation result (Hao et al. 2014). Then, our method is applied to voxels with probability values greater than 0 and smaller than 1.

Training Patch Library Construction

To label a voxel of the target image, a set of voxel-wise training samples is identified from the warped atlases. Since the registered atlas images are not always aligned with the target image perfectly, we adopt the nonlocal patch based label fusion framework to construct a training library of image patches (Coupé et al. 2011, Rousseau et al. 2011). For labeling a target voxel, voxels in a cube-shaped searching neighborhood V with size (2r s  + 1) × (2r s  + 1) × (2r s  + 1) of each atlas image are selected, and patches centered at these voxels are extracted and vectorized to form a patch library P = [p 1, p 2,  … , p n ], where n = N ∙ (2r s  + 1)3 is the number of selected patches. And the segmentation label of each image patch’s center voxel is used as the image patch’s label, l i  , i = 1 , 2 ,  …  , n. Thus, we construct a training dataset Δ = {(p i , l i )| i = 1, 2,  … , n}, where p i is the i-th image patch in the patch library P and l i is the label of its center voxel.

Metric Learning

Learning a distance metric from training samples is an important machine learning topic. Many methods have been proposed to learn distance/similarity metrics (Xing et al. 2002). Among them, learning a Mahalanobis distance metric for k-nearest neighbor classification has been successfully applied to many computer vision problems (Guillaumin et al. 2009). In this study, we adopt a supervised metric learning method to learn a Mahalanobis distance metric from the training dataset of image patches (Wang et al. 2015).

Given any two samples (p i , l i ) and (p j , l j ) from the training dataset Δ, we obtain a doublet (p i , p j ) with a label h, where h =  − 1 if l i  = l j , and h = 1 otherwise. For each training sample p i , we find its m 1 nearest similar neighbors, denoted by \( \left\{{p}_{i,1}^s,\dots, {p}_{i,{m}_1}^s\right\} \), and its m 2 nearest dissimilar neighbors, denoted by \( \left\{{p}_{i,1}^d,\dots, {p}_{i,{m}_2}^d\right\} \), and construct (m 1 + m 2) doublets:

$$ \left\{\left({p}_i,{p}_{i,1}^s\right),\dots, \left({p}_i,{p}_{i,{m}_1}^s\right),\left({p}_i,{p}_{i,1}^d\right),\dots, \left({p}_i,{p}_{i,{m}_2}^d\right)\right\} $$
(1)

By collecting all possible doublets, we build a doublet set, denoted by \( \left\{{\boldsymbol{z}}_1,\dots, {\boldsymbol{z}}_{N_d}\right\} \), where z j  = (p j , 1, p j , 2) ,  j = 1 , 2 ,  …  , N d , and the label of z j is denoted by h j . Given the doublet set \( \left\{{\boldsymbol{z}}_1,\dots, {\boldsymbol{z}}_{N_d}\right\} \), we use a kernel method to learn a classifier

$$ g\left(\boldsymbol{z}\right)=sgn\left({\displaystyle \sum_j}{h}_j{\alpha}_jK\left({\boldsymbol{z}}_j,\boldsymbol{z}\right)+b\right) $$
(2)

where z j is the j-th doublet, h j is its label, \( \boldsymbol{z}=\left({p}_{k_1},{p}_{k_2}\right) \) is a testing doublet, K(∙, ∙) is a degree-2 polynomial kernel, defined as.

$$ K\left({\boldsymbol{z}}_i,{\boldsymbol{z}}_j\right)={\left[{\left({p}_{i,1}-{p}_{i,2}\right)}^T\left({p}_{j,1}-{p}_{j,2}\right)\right]}^2 $$
(3)

Then, we have

$$ {\sum}_j{h}_j{\alpha}_jK\left({\boldsymbol{z}}_j,\boldsymbol{z}\right)+b={\left({p}_{k_1}-{p}_{k_2}\right)}^TM\left({p}_{k_1}-{p}_{k_2}\right)+b, $$
(4)

where M = ∑ j h j α j (p j , 1 − p j , 2)(p j , 1 − p j , 2)T is the matrix to be learned in the Mahalanobis distance metric. Once M is obtained, the kernel decision function g(z) can be used to determine whether \( {p}_{k_1} \) and \( {p}_{k_2} \) are similar or dissimilar to each other.

To learn M in the Mahalanobis metric, we adopt a support vector machine (SVM) model:

$$ { \min}_{M,\mathrm{b},\upxi}\frac{1}{2}{\left\Vert M\right\Vert}_F^2+C{\displaystyle \sum_j}{\upxi}_j,s.t.{h}_j\left({\left({p}_{j,1}-{p}_{j,2}\right)}^TM\left({p}_{j,1}-{p}_{j,2}\right)+b\right)\ge 1-{\xi}_j,{\upxi}_{\mathrm{j}}\ge 0,\forall \mathrm{j}, $$
(5)

where ‖∙‖ F is the Frobenius norm. The Lagrange dual problem of the above doublet-SVM model is

$$ { \max}_{\alpha }-\frac{1}{2}{\displaystyle \sum_{i,j}}{\alpha}_i{\alpha}_j{h}_i{h}_jK\left({\boldsymbol{z}}_i,{\boldsymbol{z}}_j\right)+{\displaystyle \sum_i}{\alpha}_i,s.t.,0\le {\alpha}_l\le C,\forall l,{\displaystyle \sum_l}{\upalpha}_l{h}_l=0 $$
(6)

The optimization problem can be solved using SVM solvers. In the current study, we implemented the metric learning method based on LibSVM and metric learning codes (Chang and Lin 2011, Wang et al. 2015).

To ensure M to be positive semi-definite, we compute a singular value decomposition of M = UΛV, and preserve only the positive singular values in Λ to form another diagonal matrix Λ +. Then, we let M + =  + V.

Label Fusion with the Learned Metric

With the learned Mahalanobis distance metric M, we obtain a new metric space by introducing a norm ‖∙‖ M : \( {\left\Vert x\right\Vert}_M=\sqrt{x^T Mx} \). And the distance between two samples is defined by d(x, y) = ‖x − y M .

Given a target image patch p x and training image patches p i  , i = 1 , 2 ,  …  , n, we compute their distances by

$$ {d}_i=d\left({p}_x,{p}_i\right)=\sqrt{{\left({p}_x-{p}_i\right)}^TM\left({p}_x-{p}_i\right)},\kern0.5em i=1,2,\dots, n $$
(7)

According to these distances, we select k nearest training samples \( \left\{\left({p}_{s_j},{l}_{s_j}\right)|j=1,2,\dots, k\right\} \) to form a nearest neighborhood set \( {\mathcal{N}}_k\left({p}_x\right) \), and assign their similarity weights to be one, others to be zero:

$$ w\left({p}_x,{p}_i\right)=\left\{\begin{array}{l}\ 1,\kern0.75em {p}_i\in {\mathcal{N}}_k\left({p}_x\right)\ \\ {}0,\kern0.75em {p}_i\notin {\mathcal{N}}_k\left({p}_x\right)\end{array}\right. $$

Then, we use \( \widehat{L}(x)=\frac{\sum_{i=1}^nw\left({p}_x,{p}_i\right){l}_i}{\sum_{i=1}^nw\left({p}_x,{p}_i\right)} \) to compute the target voxel’s label. Finally, the estimated label of \( \widehat{L}(x) \) is thresholded to obtain a binary segmentation label \( L(x)=\left\{\begin{array}{c}1,\kern0.75em \widehat{L}(x)>0.5\\ {}\ 0,\kern0.75em \widehat{L}(x)<0.5\end{array}\right.. \)

In the weighted voting label fusion, two strategies are available to achieve label fusion: single-point estimation strategy and multi-point strategy. In the single-point estimation strategy the label estimated from each image patch is applied to its center voxel. In the multi-point estimation strategy, the label estimated from each patch is applied to all voxels covered by the image patch itself (Rousseau et al. 2011, Wang et al. 2013, Sanroma et al. 2015). Since each voxel has multiple estimated labels from image patches that cover the voxel itself, majority voting of the multiple estimated labels can be adopted to compute a final segment label.

Experiments

We optimized the parameters of our method based on the training dataset, and then evaluated the segmentation performance based on the testing dataset. We adopted 9 segmentation evaluation measures to evaluate the image segmentation results (Jafari-Khouzani et al. 2011). By denoting A as the manual segmentation, B as the automated segmentation, and V(X) as the volume of segmentation result X, these evaluation measures are defined as:

$$ \mathrm{Dice}=2\frac{\mathrm{V}\left(\mathrm{A}{\displaystyle \cap}\mathrm{B}\right)}{\mathrm{V}\left(\mathrm{A}\right)+\mathrm{V}\left(\mathrm{B}\right)},\mathrm{Jaccard}=\frac{\mathrm{V}\left(\mathrm{A}{\displaystyle \cap}\mathrm{B}\right)}{\mathrm{V}\left(\mathrm{A}{\displaystyle \cup}\mathrm{B}\right)} $$
$$ \mathrm{Precision}=\frac{\mathrm{V}\left(\mathrm{A}{\displaystyle \cap}\mathrm{B}\right)}{\mathrm{V}\left(\mathrm{B}\right)},\mathrm{Recall}=\frac{\mathrm{V}\left(\mathrm{A}{\displaystyle \cap}\mathrm{B}\right)}{\mathrm{V}\left(\mathrm{A}\right)} $$
$$ MD={\mathrm{mean}}_{\mathrm{e}\in \partial \mathrm{A}}\left({ \min}_{\mathrm{f}\in \partial \mathrm{B}}\mathrm{d}\left(\mathrm{e},\mathrm{f}\right)\right) $$
$$ HD= \max \left(\mathrm{H}\left(\mathrm{A},\mathrm{B}\right),\mathrm{H}\left(\mathrm{B},\mathrm{A}\right)\right),\kern0.2em \mathrm{where}\ \mathrm{H}\left(\mathrm{A},\mathrm{B}\right)={ \max}_{\mathrm{e}\in \partial \mathrm{A}}\left({ \min}_{\mathrm{f}\in \partial \mathrm{B}}\mathrm{d}\left(\mathrm{e},\mathrm{f}\right)\right) $$

HD95:similar to HD,except that 5% data points with the largest distance are removed before calculation,

$$ \mathrm{A}\mathrm{SSD}=\left({\mathrm{mean}}_{\mathrm{e}\in \partial \mathrm{A}}\left({ \min}_{\mathrm{f}\in \partial \mathrm{B}}\mathrm{d}\left(\mathrm{e},\mathrm{f}\right)\right)+{\mathrm{mean}}_{\mathrm{e}\in \partial \mathrm{B}}\left({ \min}_{\mathrm{f}\in \partial \mathrm{A}}\mathrm{d}\left(\mathrm{e},\mathrm{f}\right)\right)\right)/2 $$
$$ \mathrm{RMSD}=\frac{\sqrt{{\mathrm{D}}_{\mathrm{A}}^2+{\mathrm{D}}_{\mathrm{B}}^2}}{\mathrm{card}\left\{\partial \mathrm{A}\right\}+\mathrm{card}\left\{\partial \mathrm{B}\right\}},\ \mathrm{where}\ {\mathrm{D}}_{\mathrm{A}}^2={\displaystyle \sum_{\mathrm{e}\in \partial \mathrm{A}}}\left({ \min}_{\mathrm{f}\in \partial \mathrm{B}}\mathrm{d}\left(\mathrm{e},\mathrm{f}\right)\right) $$

In the above definition, ∂A denotes the boundary voxels of A, d(∙, ∙) is the Euclidian distance of two points, card{∙} is the cardinality of a set.

Optimization of Parameters

The proposed method has following parameters: patch radius r p , searching radius r s , regularization parameter C in SVM, numbers of the nearest similar and dissimilar neighbors m 1, m 2 for constructing doublets, and number of the nearest neighbors k for selecting the most similar samples for label fusion. According to (Wang et al. 2015), we set C = 1, m 1 = m 2 = 1. We also fixed the searching radius r s  = 1 (within a searching neighborhood of 3 × 3 × 3), since a nonlinear image registration algorithm was used to warp atlas images to the target image.

The other two parameters r p and k were determined empirically in {1, 2, 3} and {3, 9, 27}, respectively, based on the training set with 40 leave-one-out cross-validation experiments. Figure 2 shows average segmentation accuracy measured by the Dice index across 40 leave-one-out cross-validation experiments with different combinations of parameters, indicating that the optimal segmentation performance could be obtained with r p  = 1 and k = 9.

Fig. 2
figure 2

Average segmentation accuracy measured by Dice index for segmentation results obtained in 40 leave-one-out cross-validation experiments with different combinations of parameters r p and k

Comparison with Existing MAIS Methods

The proposed method, referred to as nonlocal patch based weighted voting with metric learning (NLW-ML) hereafter, was compared with 6 state-of-the-art MAIS methods, including MV (Rohlfing et al. 2004, Heckemann et al. 2006), LW-INV (Artaechevarria et al. 2009), LW-GU (Sabuncu et al. 2010), NLW-GU (Coupé et al. 2011, Rousseau et al. 2011), LLL (Hao et al. 2014), and JLF (Wang et al. 2013).

The parameters of all these methods were optimized based on the same training dataset with the same parameter selection strategy. For LW-GU, patch radius r p and σ x in the Gauss similarity metric need to be determined. With cross-validation, the optimal value of r p was 2 selected from {1, 2, 3}, and σ x was adaptively set as \( {\sigma}_x={\mathit{\min}}_{x_i}\left\{{\left\Vert P(x)-P\left({x}_i\right)\right\Vert}_2+\varepsilon \right\},i=1..N \), where ε is a small constant to ensure numerical stability with a value 1e-20. LW-INV has 2 parameters, namely patch radius r p and γ in the inverse function model. The optimal values were r p  = 2 and γ =  − 3, obtained from the range of {1, 2, 3} and {−0.5, −1, −2, −3} respectively. NLW-GU has 3 parameters, namely searching radius r s , patch radius r p , and σ x in the Gauss similarity metric model. Similar to the NLW-ML, the searching radius r s was set to be 1. Based on the same cross-validation strategy, the optimal value of r p was 1 selected from {1, 2, 3}, and σ x was adaptively set as \( {\sigma}_x={\mathit{\min}}_{x_{s,j}}\left\{{\left\Vert P(x)-P\left({x}_{s,j}\right)\right\Vert}_2+\varepsilon \right\},s=1..N,j\in V \), where ε is a small constant to ensure numerical stability with a value 1e-20.

The only difference between NLW-GU and LW-GU was the image patches that they used. Particularly, NLW-GU used nonlocal image patches, i.e., the searching radius r s  > 0 was used to extract image patches. In contrast, LW-GU used local image patches, i.e., the searching radius r s  = 0. Since both of the NLW-GU method and the proposed NLW-ML method use non-local image patches, the only difference between them is the distance metric for measuring similarity between image patches. In the experiment, we found that the multi-point estimate strategy was better than the single-point strategy in all of these label fusion methods. Thus, we only report the results obtained with the multi-point strategy.

Similar to NLW-ML and NLW-GU, the searching radius rs for LLL and JFL was set to be 1. Other parameters of these two methods were optimized based on the same training set with the same parameter optimization strategy as adopted by the proposed method. For the LLL method, the optimal patch radius rp and the optimal number of training samples K were rp = 3 and K = 300, selected from {1, 2, 3} and {300, 400, 500}, respectively. Sparse linear SVM classifiers with default parameter (C = 1) were built to fuse labels in the LLL method. The single-point label fusion strategy was used in the LLL method. For the JLF method, the optimal patch radius rp and the optimal parameter β in the pairwise joint label difference term were rp = 1 and β = 1, selected from {1, 2, 3} and {0.5, 1, 1.5, 2}, respectively.

Table 2 summarizes segmentation results of the testing images obtained by the segmentation methods under comparison, including MV, LW-INV, LW-GU, NLW-GU, LLL, JLF, and NLW-ML. For each segmentation evaluation measure, the best value is shown in bold. These results indicated that the proposed method achieved the best overall performance. Specifically, Wilcoxon signed rank tests indicated that the proposed method performed significantly better than MV, LW-INV, LW-GU, NLW-GU, LLL (p < 0.001) and JLF (p < 0.05) in terms of Dice and Jaccard index values of their segmentation results. The results also demonstrated that NLW-GU performed better than LW-GU, indicating that the non-local patch based methods had better performance than traditional methods that adopted only corresponding image patches for label fusion (Coupé et al. 2011, Rousseau et al. 2011).

Table 2 Segmentation results of different label fusion methods (mean ± std)

Figure 3 shows box plots of Dice and Jaccard index values of segmentation results obtained by different methods, indicating that our proposed method performed consistently better than other label fusion methods. The superior performance of our method was also confirmed by the visualization results, as shown in Fig. 4.

Fig. 3
figure 3

Comparison of different methods for segmenting left hippocampus (denoted by red boxes) and right hippocampus (denoted by green boxes) with respect to Dice index and Jaccard index. In each box, the central mark is the median and edges are the 25th and 75th percentiles

Fig. 4
figure 4

Hippocampal segmentation results obtained by different methods. One subject was randomly chosen from the dataset. The first row shows the segmentation results produced by different methods, the second row shows their corresponding surface rendering results, and the difference between results of manual and automatic segmentation methods was showed in the third row (red: manual segmentation results, green: automated segmentation results, blue: overlap between manual and automated segmentation results)

Discussion

The proposed method is a voting based label fusion method (Liao et al. 2013, Wu et al. 2014, Tong et al. 2015, Wu et al. 2015) with an integrated learning component (Hao et al. 2012b, Hao et al. 2014, Wang et al. 2014, Bai et al. 2015, Zhu et al. 2015). The voting based label fusion methods compute the voting weights by comparing the target image patch with each atlas image patches, and use them to combine atlas labels. In contrast, machine learning based methods utilize machine learning techniques to build a mapping between the segmentation label and the image appearance. The voting based methods typically assume that image patches with similar intensity information have the same segmentation label. Although this assumption is valid in most cases, a recent study has shown that similar image patches could bear different labels (Bai et al. 2015). The machine learning based methods overcome this limitation by learning a mapping function between the image patch and the label. The proposed method combines advantages of the existing methods by first adopting a classification method to learn the relationship between image patches and segmentation labels and then fusing the labels based on weights obtained with the learned distance metric.

The metric learning is essentially a preprocessing step in pattern recognition, aiming to learn from a given training dataset a distance metric, with which data samples can be more effectively classified (Weinberger and Saul 2009). In this study, we empirically demonstrated that the metric learning in conjunction with a k-NN classifier could lead to better performance for segmenting the hippocampus from MRI scans than state-of-the-art MAIS methods, including the LLL and JFL methods. We postulate that its promising performance might due to that the k-NN classifier could potentially capture nonlinear relationship that better model the image patches of background and hippocampus than linear models built by the other methods, such as the sparse linear SVM adopted in the LLL method. In fact, many metric learning methods have been demonstrated to achieve state-of-the-art performance on pattern recognition problems (Weinberger and Saul 2009).

In our method, we used nonlinear image registration to register image blocks of the hippocampus. Our results demonstrated that a small patch size was good enough to capture inter-subject anatomical differences. Since the metric learning could adaptively learn a distance metric for image patches from training data, our method is not sensitive to the patch size as the traditional patch based methods.

The computational burden for image registration is a major issue in the multi-atlas segmentation methods. To avoid the high computational cost of non-rigid image registration, non-local patch-based image labeling strategies were proposed so that linear image registration could be used to align the image to be segmented and the atlas images (Coupé et al. 2011). However, a non-local image patch searching procedure has to be adopted to identify similar image patches in the label fusion step, which often leads to higher computational cost than using non-rigid image registration in the atlas image registration (Rousseau et al. 2011). More recently, an optimized patch match strategy was proposed to improve the segmentation (Giraud et al. 2016). In the current study, we adopted an atlas selection strategy to reduce the computational cost associated with the nonlinear image registration (Aljabar et al. 2009, Hao et al. 2014). Particularly, the most informative atlases were first selected before the nonlinear image registration. Following (Hao et al. 2014), we selected 20 atlas images for segmenting each target image. The computational complexity of our label fusion method is similar to classification based methods (Hao et al. 2014, Bai et al. 2015). For a MATLAB based implementation of our algorithm, it took ~20 min to fuse labels for segmenting one side of the hippocampus on a personal computer with 4 cores of 3.4G HZ CPU.

It is straightforward to extend the metric learning method for multi-class classification problems in that the metric learning maximizes margin between differences of intra-class and inter-class samples. However, for most brain region segmentation problems with multiple regions to be segmented we could formulate the multi-class classification problem as multiple one-against-the-rest binary classification problems. Such a setting might be better to handle unbalanced training samples since we build local classifiers for different voxels of the brain instead of a global one for the whole brain voxels.

Our future work will integrate the supervised metric learning method and more sophisticated weighted voting label fusion methods, such as joint label fusion (Wang et al. 2013), in which label error is measured by the distance of patches with a predefined distance metric. Furthermore, our method can also be adopted in the shape constrained segmentation framework (Hao et al. 2012a). We will also combine our method with functional MRI image based hippocampus parcellation (Cheng and Fan 2014).

Conclusion

In the paper, we propose a novel nonlocal patch based weighted voting label fusion method with a learned distance metric for measuring similarity between image patches. The validation experimental results have demonstrated that the proposed method could achieve better segmentation performance than start-of-the-art MAIS methods, indicating that the learned distance metric for measuring similarity of image patches could improve the segmentation performance.

Information Sharing Statement

Software developed in this manuscript is available upon request from Dr. Fan or Dr. Zhu.