1 Introduction

Energy minimization is of strong practical and theoretical importance to computer vision. An energy expresses our criteria for a good solution—low energies are good, high energies are bad—independent of any algorithm. Algorithms are however hugely important in practice. Even for low-level vision problems we are confronted by energies that are computationally hard (often NP-hard) to minimize. As a consequence, a significant portion of computer vision research is dedicated to identifying energies that are useful and yet reasonably tractable. Our work is of precisely this nature.

Computer vision is full of ‘labeling’ problems cast as energy minimization. For example, the data to be labeled could be pixels, interest points, point correspondences, or mesh data such as from a range scanner. Depending on the application, the labels could be either semantic (object classes, types of tissue) or describe geometry/appearance (depth, orientation, shape, texture).

There are many labeling problems for which the labels naturally form groups. In computer vision, a recent trend is the use of ‘context’ to resolve ambiguities in object recognition (e.g. Choi et al. 2010; Ladický et al. 2010a; Zhou et al. 2011). The idea is that certain groups of labels are self-consistent because they tend to appear together, e.g. the {car,road,sky} labels all belong to the “outdoors” context, while {table,chair,wall} all belong to the “indoors” context. In computer graphics, one may wish to automatically classify the faces of a 3D mesh into semantic parts for the benefit of artists and animators (Kalogerakis et al. 2010). The part labels arm, tail, and wheel naturally belong to different groups based on their context (humanoid, quadruped, vehicle). In operations research, facility location can be cast as a labeling problem, and hierarchical variants have been studied (Svitkina and Tardos 2006; Sahin and Süral 2007). All of these disparate labeling problems are similar from an optimization point of view.

When labels are explicitly grouped in a hierarchy, the costs in the energy are naturally structured. In this work, we characterize a class of energies as having hierarchical costs. If an energy satisfies our “h-metric” and “h-subset” conditions, then we can often minimize it much more effectively. We provide a novel hierarchical fusion (h-fusion) algorithm to minimize our class of energies. Our algorithm generalizes the well-known α-expansion algorithm (Boykov et al. 2001) yet we provide better empirical performance and a better approximation bound. The improved theoretical guarantees are important because, in practice, α-expansion can easily get stuck in poor local minima for this useful class of energies; to the best of our knowledge, our h-fusion algorithm is state of the art. Like the original fusion algorithm (Lempitsky et al. 2010) ours is highly parallelizable.

With respect to our energy itself, the most relevant work is the “label costs” of Delong et al. (2012). Our notion of hierarchical costs is a special case of their energy, yet it is important to explicitly characterize this subclass because, as we show, it permits a much better minimization algorithm. With respect to our algorithm, by far the most closely related work is the r-hst metrics of Kumar and Koller (2009). We review both these works in some detail.

2 Review of Related Work

First we review energies with “label costs” as described in Delong et al. (2012). Let the set \(\mathcal{P}\) index the data that needs to be labeled, and let \(\mathcal{L}\) be the set of possible labels. A labeling is any complete assignment \(f = (f_{p})_{p \in \mathcal{P}}\) where variable \(f_{p} \in\mathcal{L}\) designates the label assigned at index p. For example if \(\mathcal{P}= \{p,q\}\) and \(\mathcal{L}=\{\ell_{1},\ell _{2}\}\) then a valid labeling might be f=( 2, 1) where f p = 2,f q = 1.

We seek joint assignment f that minimizes an energy balancing three types of criteria

$$ E(f) = D(f) + V(f) + H(f). $$
(1)

The D term encodes the individual preference of each variable f p , whereas the V term encourages pairs of variables to take similar values. These are typically expressed as

where the neighborhood set \(\mathcal{N}\) identifies pairs of variables f p and f q that interact, and each weight w pq ≥0 scales the strength of V for that particular pair. Energies with these terms are quite standard in vision, particularly in models based on Markov random fields (MRFs) and when performing maximum a posteriori (MAP-MRF) inference (Li 1994).

The H term encourages the labeling f to use as few unique labels as is necessary. For example, do not explain the data with six labels if five would suffice just as well. The general form that we use in this work can be expressed as

where \(\mathcal{H}\) is any collection of subsets of \(\mathcal{L}\), each H(L) is the label cost of subset \(L \subseteq \mathcal{L}\), and the label cost indicator δ L (⋅) is defined as

$$ \delta_L(f) = \left\{ \begin{array}{l@{\quad}l} 1 & \text{if } \exists p : f_p \in L \\[3pt] 0 & \text{otherwise.} \end{array} \right. $$

Intuitively, H(L) is the shared cost to be paid if f uses any labels drawn from label group L. The cost is shared because it is paid at most once, regardless of how many labels from L appear in f. In many applications there is no reason to group the labels, and so setting \(\mathcal{H}= \{ \{\ell\} \}_{\ell\in\mathcal{L}}\) restricts us to individual per-label costs H({}) or, with slight abuse of notation, setting \(\mathcal{H}= \mathcal{L}\) and writing each per-label cost as H(). Label costs are an important special case of ‘high-arity’ potentials recently studied in computer vision (e.g. Werner 2008; Woodford et al. 2009).

From an energy standpoint there are a number of works that combine data costs D with V and/or H. Table 1 lists a small selection of such works in computer vision. As can be seen from the V column, many techniques are only applicable if V(⋅,⋅) satisfies certain assumptions.

Table 1 A selection of relevant energy-based problem formulations in computer vision. All are a special case of energy (1), with the exception of Ladický et al. (2010a) which in principle allows for a wider class of energies that encourage ‘parsimony’

Definition 1

A smoothness cost V is semi-metric if it satisfies

and V is a metric if it additionally satisfies

These assumptions were originally outlined by Boykov et al. (2001) as sufficient conditions for their α-expansion and αβ-swap algorithms. These conditions arise because of the inherent limitations of graph cut methods (Boykov and Jolly 2001; Kolmogorov and Zabih 2004). Because our algorithmic approach is also based on graph cuts, we shall see a similar kind of limitation arise in Sect. 4.1.

We note that, with the exception of Zhu and Yuille (1996), all the works listed in Table 1 are of a discrete nature where \(\mathcal{P}\) is a finite set. A number of variational formulations of E have recently been developed with continuous analogues of D+V (e.g. Pock et al. 2008, 2009; Olsson et al. 2009) and of D+V+H (Yuan and Boykov 2010). Our main ideas also apply to such continuous formulations, but we focus on the discrete setting.

Energies of the form (1) are NP-hard to minimize in all but a few special cases. Even D+V is NP-hard to minimize for \(|\mathcal{L}|\geq3\) by reduction from 3-Terminal-Cut (Boykov et al. 2001); in fact this case is max-SNP-hard (Cunningham and Tang 1999), meaning there is some ϵ>0 for which no polynomial-time (1+ϵ)-approximation algorithm can exist (i.e. no polynomial-time approximation scheme (PTAS)). The D+H case is NP-hard by straight-forward reduction from Set-Cover using only per-label costs H(). A hardness result for approximating Set-Cover by Feige (1998) implies that D+H cannot be approximated within a ratio of \((1-\epsilon)\ln|\mathcal{P}|\) in polynomial time unless the complexity class NPDTIME[n O(loglogn)], i.e. NP would have to be only slightly super-polynomial, which is currently deemed unlikely. This observation will help to put the approximation bound for our algorithm into perspective.

Observation 1

Feige’s hardness result is evidence that no polynomial-time algorithm can minimize energy (1) within a constant ratio of the optimum.

From an algorithmic standpoint, the most similar work to ours is a recent paper by Kumar and Koller (2009). Their aim is to efficiently minimize energies of the form D+V. They consider the class of r-hierarchically well-separated tree (r-hst) metrics (Bartal 1998) which are a special case of metrics defined above. We discuss r-hst metrics in Sect. 8, but for now it is enough to understand that they are a special case of metrics and that each has an associated constant r>1. Kumar and Koller provide an algorithm that, for a particular r-hst metric, provides an \(\frac{2r}{r-1}\)-approximation to the globally optimal labeling. Although this coefficient is very large for r≈1, the approximation only depends on V (not on \(|\mathcal{P}|\) or \(|\mathcal{L}|\)). In some cases this ratio is better than the well-known bound for the α-expansion algorithm (Boykov et al. 2001) and the \(O(\lg|\mathcal{L}|\lg\lg|\mathcal{L}|)\) bound for linear programming relaxation (Kleinberg and Tardos 2002).

Kumar and Koller describe their algorithmic process as hierarchical graph cuts. This does not refer to computing a graph cut in a hierarchical manner, but rather to minimizing an energy D+V via a hierarchical sequence of standard graph cuts. They show that the r-hst metric assumption on V is sufficient to apply their algorithm. We aim to minimize energies of the form D+V+H and, motivated by the difficult H term, have independently developed an algorithm we call hierarchical fusion (h-fusion). However, at the highest conceptual level our algorithm is the same as theirs—it is only our class of energies and our sequence of subproblems that is different, each of which we solve with the extended α-expansion of Delong et al. (2012). The h-fusion process will be explained in Sect. 5.

Also worth mentioning is a recent work by Felzenszwalb et al. (2010) concerning energies of the form E=D+V. By making the strong assumption that both D and V are tree metrics, they can compute a global optimum. However, most applications do not satisfy the metric assumption on data costs D. Note that our work makes no such assumption. We discuss tree metrics in Sect. 4.1.

Our contributions

This work is about characterizing a class of energies that will be useful in vision, along with a fast and effective algorithm for dealing with large-scale problems. Specifically, for energies of the form D+V+H,

  1. (a)

    we define h-metric smoothness costs V, a wider class than tree metrics yet still sufficient for our h-fusion algorithm to apply,

  2. (b)

    we define h-subset label costs H, a sufficient condition to apply h-fusion with high-order label costs,

  3. (c)

    we prove that the approximation bound of h-fusion is much better than α-expansion in important cases, and

  4. (d)

    we provide worst-case examples to show that our theoretical bound is tight in some reasonable sense.

Contribution (a) is about a more general characterization of V than that used by Kumar and Koller (2009) and by Felzenszwalb et al. (2010), while (b–d) are completely novel results on how to handle label costs in this framework.

The remainder of this paper is organized as follows. Section 3 reviews the α-expansion algorithm in some detail. Section 4 then introduces our notion of hierarchical costs, a useful subclass of energy (1). Section 5 describes our h-fusion algorithm, and Sect. 6 derives its approximation bound. Section 7 gives some experiments to suggest how our energies and algorithm work. Finally, Sect. 8 discusses other applications, relations to facility location, and extensions.

3 Review of α-Expansion

The algorithm we introduce in this paper uses the well-known α-expansion algorithm (Boykov et al. 2001) as a key subroutine. The algorithm was designed for energies of the form D+V, though we employ an extension to D+V+H by Delong et al. (2012). Our approximation bound is therefore intricately linked with α-expansion and its limitations, so we review the algorithm here. Readers familiar with α-expansion may skip ahead to Sect. 4.

3.1 How α-Expansion Works

The α-expansion algorithm performs local search using a powerful class of ‘moves’. Given an initial labeling \(\hat{f}\) and some particular label \(\alpha\in\mathcal{L}\), an α-expansion move gives each variable the following binary choice: either keep the current label \(\hat{f}_{p}\), or switch to label α. Let \(\mathcal{M}^{\alpha}(\hat{f})\) denote the set of all moves (labelings) that can be generated this way, in other words

$$ \mathcal{M}^{\alpha}(\hat{f}) = \bigl\{ f : f_p \in\{\hat{f}_p\} \cup\{\alpha\} \bigr\}. $$
(2)

All variables are simultaneously allowed to keep their current label or to switch, so there are an exponential number of possible moves. For each choice of α we must efficiently find the best possible move. In practice, this sub-problem is solved by casting it as a graph cut (Greig et al. 1989) and using combinatorial algorithms to compute the optimal binary configuration (e.g. Goldberg and Tarjan 1988; Boykov and Kolmogorov 2004; Strandmark and Kahl 2010). Figure 1 illustrates the steps behind a single expansion move. Because a graph cut finds the best move from an exponential number of possibilities, the α-expansion algorithm is a very large-scale neighborhood search (VLSN) technique (Ahuja et al. 2002) and is very competitive in practice (Szeliski et al. 2006).

Fig. 1
figure 1

In an α-expansion move, each location p is given a binary choice: keep its current label \(\hat{f}_{p}\) or switch to label α. Here we use 1 to indicate a variable that decided to take label α. Because variables interact through V, the binary decisions are inter-dependent and so the optimal move is calculated by a graph cut

With respect to some current labeling \(\hat{f}\), the full set of possible expansion moves is \(\mathcal{M}(\hat{f}) = \bigcup_{\alpha \in \mathcal{L}} \mathcal{M}^{\alpha}(\hat{f})\). The α-expansion algorithm simply performs local search over the full search neighborhood \(\mathcal{M}(\hat{f})\). Perhaps surprisingly, local search with expansion moves will terminate with a labeling \(\hat{f}\) that is within a constant factor from the globally optimal labeling f (see Sect. 3.3). The α-expansion algorithm is generally implemented as shown below.

Algorithm 1
figure 2

α-Expansion(E)—local search with expansion moves

3.2 Graph Cuts and the Limits of α-Expansion

From Table 1 we see that α-expansion is applicable if V is a metric (Definition 1). We briefly review how this limitation arises, as it will be relevant to our new h-fusion algorithm.

The main subproblem for α-expansion (line 4) is to find, for a particular label \(\alpha\in\mathcal{L}\), the best move \(f \in \mathcal{M}^{\alpha}(\hat{f})\). The best move is the one with minimal E(f). Expansion moves are fundamentally binary so we can encode a move as a function f(x) of binary vector \(\mathbf{x}=(x_{p})_{p \in \mathcal{P}}\) where

$$ f(\mathbf{x})_p = \left\{ \begin{array}{l@{\quad}l} \hat{f}_p & \text{if } x_p = 0 \\[3pt] \alpha& \text{if } x_p = 1. \\ \end{array} \right. $$

To solve the subproblem on line 4, we can construct a binary energy E′(x)=D′(x)+V′(x) but with specially constructed data terms \(D'_{p}\) and smoothness terms \(V'_{pq}\):

$$ \begin{array}{l@{\qquad}l} D'_p(0) := D_p\bigl(\hat{f}_p\bigr) & V'_{pq}(0,0) := V\bigl (\hat{f}_p,\hat{f}_q\bigr)\\[3pt] D'_p(1) := D_p(\alpha) & V'_{pq}(0,1) := V\bigl(\hat{f}_p,\alpha \bigr)\\[3pt] & V'_{pq}(1,0) := V\bigl(\alpha,\hat{f}_q\bigr)\\ [3pt] & V'_{pq}(1,1) := V(\alpha,\alpha)\\ \end{array} $$
(3)

where D and V are the terms of the original multi-label energy E(f). Since E′(x)=E(f(x)), minimizing E′ finds an optimal expansion move w.r.t. E.

A binary energy of form E′=D′+V′ can be minimized efficiently by a graph cut if E′(x) is a submodular function (Boros and Hammer 2002; Kolmogorov and Zabih 2004). Energy E′ is submodular if and only if it satisfies

(4)

for all \(pq \in\mathcal{N}\). Now, for which multi-label energies does the expansion energy (3) result in submodular E′? The submodularity condition (4) holds if and only if

(5)

for all neighbors pq and all possible values of \(\hat{f}_{p}\) and \(\hat{f}_{q}\). Since we must assume \(\hat{f}_{p}\) and \(\hat{f}_{q}\) could take any label in \(\mathcal{L}\) we arrive at the following due to Kolmogorov and Zabih (2004), Boykov and Veksler (2006).

Observation 2

The α-expansion algorithm is applicable to energies of the form D+V if and only if for all \(\alpha,\beta,\gamma\in\mathcal{L}\)

(6)

The metric assumption is sufficient for (6) to hold, and so α-expansion is applicable if V is metric energy.

Rother et al. (2005) showed that, by assuming a non-arbitrary initial labeling, α-expansion can be applied to a wider class of energies: for each \(\beta,\gamma\in\mathcal{L}\), either V must satisfy (6) for all \(\alpha\in\mathcal{L}\), or V(β,γ)=∞. Unfortunately, α-expansion offers no approximation guarantees for this extended class (Theorem 1 below). In this paper we define a class of smoothness costs V called h-metrics, and it too can be extended to non-metric infinities this way. However, we aim to quantify approximation bounds for our algorithm so, for simplicity, we will not include such infinities in our definition of h-metrics.

Delong et al. (2012) showed that α-expansion is also applicable to energies with label costs as long as H(L)≥0 for each label subset \(L \subset\mathcal{L}\). The expansion step requires a binary energy of the form E′(x)=D′(x)+V′(x)+H′(x) where H′ defines very high-order potentials over x, unlike V′ which defines only ‘quadratic’ (pairwise) potentials. We will use their construction in our main subroutine.

3.3 Approximation Bounds of α-Expansion

Local search with expansion moves is guaranteed to terminate at a local minimum \(\hat{f}\) that is within a constant factor of the global optimum f  (Veksler 1999; Boykov et al. 2001). The actual bound is \(E(\hat{f}) \leq2c E(f^{*})\) where c≥1 is some constant that depends on V. If c is small, then we can expect α-expansion to do at least a reasonable job. If c is large, the bound is meaningless and we have even more reason to try other algorithms (e.g. Kolmogorov 2006).

Understanding the approximation bound of α-expansion will be helpful for understanding our generalized bound in Sect. 6. The following holds for any energy E=D+V withFootnote 1 D p (⋅)≥0 and metric V(⋅,⋅).

Theorem 1

(Veksler 1999)

If f is a global minimum of E=D+V, and \(\hat{f}\) is a local minimum w.r.t. expansion moves, then

$$ E(\hat{f}) \leq2 c E\bigl(f^*\bigr) \quad \text{\textit{where} } c = \frac{ \max_{\alpha\neq\beta\in\mathcal{L}} V(\alpha, \beta) }{ \min_{\gamma\neq\zeta\in\mathcal{L}} V(\gamma, \zeta) }. $$
(7)

In other words, α-expansion is a 2c-approximation algorithm where c≥1 depends on the ratio of largest to smallest costs in V. Below are some V(⋅,⋅) terms commonly used in vision, shown in matrix form for \(|\mathcal{L}|=5\).

figure a

Underneath we see the coefficient c corresponding to each case. The simplest potential (Potts) penalizes f p f q equally, and gives the best approximation bound. When the range of values is large, e.g. for the ‘linear’ penalty of |f p f q |, the bound (7) gets worse. Our h-fusion algorithm beats the α-expansion bound for a wide class of ‘hierarchical’ costs.

Delong et al. (2012) showed that incorporating label costs H(⋅) into α-expansion can worsen the above bound. If arbitrary label costs H(L)≥0 are assumed on arbitrary subsets \(L \subset\mathcal{L}\) then the bound is as follows.

Theorem 2

(Delong et al. 2012)

If f is a global minimum of E=D+V+H, and \(\hat{f}\) is a local minimum w.r.t. α-expansion, then the following bound is tight:

$$ E(\hat{f}) \leq(2 c + c_{2}) E\bigl( f^{*}\bigr) + \sum_{L \subset\mathcal{L}} H(L) $$
(8)

where c 2=max L:H(L)>0(|L|−1).

This tells us that, if arbitrary label costs are assumed, standard α-expansion is no longer a constant-ratio approximation algorithm (recall Observation 1) and furthermore the bound gets worse (c 2 gets larger) if costs are defined on large subsets L.

4 Energies with Hierarchical Costs

We wish to minimize an energy of the general form (1), but we assume the labels are grouped in some kind of hierarchy. Depending on the application, the grouping will likely be either semantic (hierarchy of object labels) or geometric (families of geometric models). One option for minimizing such energies is to ignore the grouping and simply apply α-expansion. However, Theorem 2 suggests caution, because α-expansion finds poor local minima when, for example, strong high-order label costs are involved.

We express the label grouping structure as follows. The leaves of the tree are the actual labels \(\mathcal{L}\). The root of the tree we denote by r. For trees of non-trivial structure, we call the extra nodes pseudo-labels and denote them by the set \(\mathcal{S}\), so the set of all intermediate nodes is \(\mathcal{S}\cup\{r\}\). The structure of the tree itself can be defined by a child-to-parent map π(⋅) where for each node \(i \in\mathcal{L}\cup\mathcal{S}\) a parent \(\pi(i) \in\mathcal{S}\cup\{r\}\) is defined. We use \(\mathcal{T}= \mathcal{L}\cup\mathcal{S}\cup\{r\}\) to denote all tree nodes. Figure 2 shows a simple label hierarchy.

Fig. 2
figure 3

A hierarchical grouping of label set \(\mathcal{L}=\{\ell_{1},\ldots,\ell_{6}\}\) into groups \(\mathcal{S}=\{A,B\}\) where, for example, the parent π( 4)=B. At right is a possible labeling f and the pseudolabeling (πf) it induces

Merely declaring the labels to be ‘grouped’ does not in itself change energy E(f) (we still have \(f_{p} \in\mathcal{L}\)) nor is the standard α-expansion algorithm ‘aware’ of a label hierarchy. However, in Sects. 4.110 we describe energies for which a ‘good’ tree can be defined so that our h-fusion algorithm (Sect. 5) is provably better than α-expansion. In fact if one defines a flat tree (\(\mathcal{S}= \{\,\}\)) then our algorithm and approximation bounds all reduce to those of Boykov et al. (2001) and Delong et al. (2012) as a special case.

Sections 4.1 and 10 develop key definitions: the class of h-metrics for smoothness costs V, and the class of h-subsets for label costs H. These definitions are directly motivated by our h-fusion algorithm and how its computation is organized. If an energy satisfies our h-metric and h-subset assumptions, then it can be minimized by h-fusion (Sect. 5).

4.1 Hierarchical Smoothness Costs (h-Potts, h-Metrics)

The following notation will be helpful for discussing a tree defined by parent function π. We use π n(i) to denote n applications of π, as in π(⋯π(i)). The set of children of a node j is denoted by

$$ \mathcal{I}(j) = \bigl\{ i \in\mathcal{T}: \pi(i) = j \bigr\}. $$

The set of all nodes in the subtree rooted at j is denoted by

$$ \mathrm{subtree}(j) = \bigl\{ i \in\mathcal{T}: \pi^n(i) = j \text { for some } n \ge0 \bigr\}. $$

The set of labels belonging to the subtree of node i is

$$ \mathcal{L}_i = \bigl\{ \ell\in\mathcal{L}: \ell\in\mathrm {subtree}(i) \bigr\}. $$

A hierarchical grouping of labels \(\mathcal{L}\) induces a grouping of the smoothness cost values inside each V(⋅,⋅) potential. Looking at the tree structure in Fig. 2 we can say that labels 1 and 2 are both in group A whereas 1 and 4 are from different groups; thus V( 1, 2) can be interpreted as a “within-group cost” and V( 1, 4) as a “between-group cost.” An example is illustrated below, where regions of the \(|\mathcal{L}| \times |\mathcal{L}|\) matrix are delineated.

figure b

Individual costs inside each block can vary, though it is helpful to consider the simple case when the cost within each block is constant. So, we now define hierarchical Potts, a natural class of hierarchical smoothness costs V parameterized by node-based ‘transition’ costs \(\{w_{i}\}_{i \in \mathcal{S}\cup\{r\}}\). In what follows, lca(α,β) denotes the lowest common ancestor of nodes α and β with respect to the tree structure π.

Definition 2

(Delong et al. 2011)

Given tree structure π, for each node \(i \in\mathcal{T}\) let w i ≥0 be its transition cost so that V(α,β)=w lca(α,β) for all \(\alpha,\beta\in\mathcal{L}\) and w i =0 for each leaf \(i \in\mathcal{L}\). We then say that (V,π) forms a hierarchical Potts (h-Potts) potential.

For example, Delong et al. (2011) use a two-level tree where w r is the transition cost between ‘super-labels’ and each w i for \(i \in\mathcal{S}\) is the transition cost between ‘sub-labels’ in group i. They show that if w i ≤2w r then V is metric and standard α-expansion can be applied. For our h-fusion algorithm to apply, a simple sufficient condition is that w i w π(i) for all \(i \in\mathcal{L}\cup\mathcal{S}\).

Now we define h-metrics, a class of hierarchical smoothness costs where V is not necessarily parameterized by w i . As we shall see in Sect. 5, the h-metric assumption is necessary for our specialized algorithm.

Definition 3

We say that pair (V,π) forms a hierarchical metric (h-metric) if π is irreducibleFootnote 2 and for every \(i \in \mathcal{L}\cup\mathcal{S}\)

(9)

Note that for a flat tree (\(\mathcal{S}= \{\,\}\)) each set \(\mathcal{L}_{i}=\{ i \}\) so we always have α 1=α 2. It is easy to show, then, that the h-metric constraint (9) on a flat tree reduces to the standard α-expansion constraint (6). Figure 3 gives a concrete example of an h-metric and shows how (9) constrains the costs that V can encode.

Fig. 3
figure 4

The costs in (a) define an h-metric for the tree structure in Fig. 2. Inequality (9) is satisfied for every tuple (α 1,α 2,β,γ) required; the two tuples emphasized at center are ( 2, 2, 4, 6) where (9) becomes 0+4≤3+3, and ( 3, 1, 6, 4) where it becomes 2+4≤3+3. On the right, (b) shows a totally different h-metric that is also h-Potts with w i <w π(i)

For a fixed tree structure, the relationship between h-Potts, h-metrics, tree metrics and r-hst metrics is shown by the set inclusion diagram below (see Appendix A for proof). Our h-fusion algorithm will be applicable to the shaded cases.

figure c

4.2 Hierarchical Label Costs (h-Subsets)

We have already defined a notion of ‘hierarchical’ smoothness costs (h-metrics), and we now do the same for label costs. As we shall see, if an energy E(f) has hierarchical costs with respect to some tree, then E(f) can be minimized by our h-fusion algorithm on that tree (Sect. 5).

Definition 4

Given an irreducible tree structure π, we define its hierarchical label subsets (h-subsets) as

$$ \mathcal{H}= \{ L \subseteq\mathcal{L}: L \cap\mathcal{L}_i = \emptyset\vee L \subset\mathcal{L}_i \vee L \supseteq \mathcal{L}_i\ \forall i \in\mathcal{T}\} $$

or, equivalently

$$ \mathcal{H}= \biggl\{ L \subseteq\mathcal{L}: L=\bigcup _{i \in I} \mathcal{L}_i \text{for some siblings} I \subset\mathcal{I}(j) \biggr\}. $$

For example, assume we have \(\mathcal{L}=\{\ell_{1},\ldots,\ell_{6}\}\) grouped in the structure shown below:

figure d

The hierarchical label subsets are

Note that sets like { 2, 3} and { 1, 2, 3} are not in \(\mathcal{H}\) because they cannot be generated from a union of siblings in the particular hierarchy chosen.

Definition 5

Given a tree π we say that (H,π) form hierarchical label costs if \(H(L)>0 \Rightarrow L \in\mathcal{H}\), i.e. if label costs appear only on the h-subsets.

Note that for a flat tree, the set \(\mathcal{H}= 2^{\mathcal{L}}\) and so all subsets are considered ‘hierarchical’ in this degenerate case.

5 Hierarchical Fusion Algorithm (h-Fusion)

Recall that the α-expansion algorithm (Sect. 3, Boykov et al. 2001) minimizes a multi-label energy by constructing a sequence of binary energies. Our h-fusion algorithm constructs a hierarchical sequence of multi-label energies, each of which is solved by running α-expansion as a subroutine. These intermediate multi-label energies are designed to ‘stitch’ or to ‘fuse’ labelings that were computed earlier in the sequence. As we shall see, this procedure provides better optimality guarantees than α-expansion for a wide class of energies, particularly those with strong label costs.

We name our algorithm h-fusion after the binary fusion moves of Lempitsky et al. (2010). They propose an iterative algorithm to minimize energies of the form E=D+V. Given a two candidate labelings \(\hat{f}^{A}\) and \(\hat{f}^{B}\), they try to find a lower-energy labeling by ‘fusing’ the best parts of each. This is highly analogous to optimized crossover operations (Aggarwal et al. 1997), a successful technique in genetic algorithms (discussed in Sect. 8). Figure 4 illustrates how new labelings are generated this way in our hierarchy. The key insight of Lempitsky et al. (2010) is that all fusion moves can be generated by a binary vector \(\mathbf{x}=(x_{p})_{p \in\mathcal{P}}\) where the move itself is \(f = \mathbf{x}\hat{f}^{A} + \bar{\mathbf{x}} \hat{f}^{B}\). Note the similarity with the inner loop of α-expansion: an expansion move from current labeling \(\hat{f}\) can now be thought of as the binary fusion \(f = \mathbf{x}\boldsymbol{\alpha}+ \bar{\mathbf{x}} \hat{f}\) with constant labeling α=(α,…,α). They propose a tennis-tournament strategy for selecting pairs of labelings to fuse, and they stop after the energy fails to decrease for some number of attempts. However, they cannot always find an optimal fusion because they fuse arbitrary labelings; the resulting binary energy is non-submodular and therefore NP-hard in general (see Sect. 3.2). They must rely on minimization methods like QPBO (Kolmogorov and Rother 2007; Rother et al. 2007) with no approximation guarantees.

Fig. 4
figure 5

Given labelings \(\hat{f}^{A}\) and \(\hat{f}^{B}\), the set of all possible fusions \(\mathcal{M}(\hat{f}^{A},\hat{f}^{B})\) includes the two ‘stitched’ labelings shown at bottom. Our move space \(\mathcal{M}(\hat{f}^{1},\ldots,\hat{f}^{k})\) fuses any number of labelings

In contrast, we construct multi-label subproblems and approximately minimize each one with α-expansion. Ours is therefore a graph cut approach that does not rely on QPBO, message passing (Kolmogorov 2006), or other LP relaxations (Werner 2008). Furthermore we,

  • handle energies with label costs (E=D+V+H),

  • characterize the subclass of energies (h-metrics, h-subsets) for which h-fusion is applicable, and

  • prove approximation bounds that generalize and improve upon those of α-expansion.

To the best of our knowledge, we are the first to incorporate high-arity costs (label costs) into a fusion-based algorithm.

At each step of our h-fusion algorithm, the main subproblem is to find \(\operatorname*{\arg\hspace{-.1em}\min}_{f \in\mathcal{M}(\hat{f}^{1},\ldots,\hat{f}^{k})} E(f)\) where each \(\hat{f}^{i}\) is a fixed labeling with \(\hat{f}^{i}_{p} \in \mathcal{L}_{i}\) and the set of all possible fusions is

(10)

Crucially, in our framework no two labelings \(\hat{f}^{i}\) and \(\hat{f}^{i'}\) can contain the same labels, unlike the ‘tennis-tournament’ scheme. Given a tree structure π on label set \(\mathcal{L}\), our algorithm generates labelings in a bottom-up fashion with respect to π. Figure 5 shows the order of sub-problems on a simple tree structure. For each internal node j a multi-label fusion energy is constructed and approximately minimized by α-expansion. The local minima computed at one level of the tree are subsequently fused at the next-higher level. Recursive code is given below, where calling h-Fusion(r) on root node r builds the final labeling in bottom-up manner.

Fig. 5
figure 6

The h-fusion steps on a tree. Subproblems 1, 2, 3 independently generate labelings using mutually exclusive labels. Subsequent steps fuse these labelings a bottom-up manner

Algorithm 2
figure 7

h -Fusion(j)—for node j, outputs labeling \(\hat{f}^{j}\) with \(\hat{f}^{j}_{p} \in \mathcal{L}_{j}\)

The key question for h-Fusion is how to set up E′ (line 5) so that it encodes our original energy E over all possible fusions, i.e. over all labelings in \(\mathcal{M}(\{ \hat{f}^{i}\}_{i \in\mathcal{I}(j)})\). Given a set of labelings \(\{ \hat{f}^{i} \}_{i \in\mathcal{I}(j)}\) there is a one-to-one correspondence between mappings \(g : \mathcal{P} \rightarrow\mathcal{I}(j)\) and labelings \(f \in \mathcal{M}(\{\hat{f}^{i}\}_{i \in\mathcal{I}(j)})\). We let f(g) be the labeling \(f_{p} = \hat{f}_{p}^{g_{p}}\) corresponding to g. We can then design an unconstrained energy E′ such that E′(g)=E(f(g)) for all g.

Again, we define E′ to be an energy over child indices. It will involve the familiar terms

$$ E'(g) = D'(g) + V'(g) + H'(g) $$
(11)

where D′ takes the usual form and V′ takes the form

$$ V'(g) = \sum_{pq \in\mathcal{N}}V'_{pq}(g_p,g_q). $$

As we shall see, the label costs H′ of our fusion energy must take the form of local label costs (Delong et al. 2012)

where cost \(H_{P_{I}}'(I)\) is only applied for particular subset of variables \(P_{I} \subseteq\mathcal{P}\) associated with child indices \(I \subseteq\mathcal{I}(j)\). In other words, we need E′ to encode costs of the form “pay if g uses this child here.” The precise costs encoded by D′,V′ and H′ are shown in ConstructFusionEnergy below.

Algorithm 3
figure 8

ConstructFusionEnergy(j)

The correctness of D′ and V′ are self evident; the algorithm of Kumar and Koller (2009) includes lines 12 but on a more restrictive class of metrics. However, our work was mainly motivated by label costs. It is not obvious how lines 46 encode original label cost H(L) as a local label cost \(H_{P_{I}}'(I)\). We now verify its correctness, first by simple example and then by proving it in general.

Example 1

Suppose E′ must fuse the two child labelings \(\hat{f}^{1}\) and \(\hat{f}^{2}\) shown below.

figure e

Assume \(\mathcal{H}= \{\{\ell_{1}\},\{\ell_{2}\},\{\ell_{3}\},\{\ell_{4}\} ,\{\ell_{1},\ell_{2}\},\{\ell_{3},\ell_{4}\} \}\) and H(L)>0 for each \(L \in\mathcal{H}\). Each g p ∈{1,2} and so E′(g) can account for H(L) by setting

Costs \(H_{\mathcal{P}}'(i)\) (left-hand column) are global label costs in the fusion energy. The other costs \(H_{P}'(i)\) (right-hand column), are localized label costs that encode the original (global) per-label costs H(). In this example, the fusion below at left should pay all label costs except H( 3), whereas the fusion at right should pay all costs except H( 2).

figure f

Theorem 3

If energy E has hierarchical label costs (H,π) then E′(g)=E(f(g)) for all \(g : \mathcal{P} \rightarrow\mathcal{I}(j)\).

Proof

The correctness of D′ and V′ are self-evident so we focus on proving that \(H_{P_{I}}'(I)\) correctly encodes H(L) for some subset of labels \(L \in\mathcal{H}\). This reduces to showing that δ L (f(g))=δ I (g P ) where indices I and pixels P=P I are as defined on lines 4 and 5 of ConstructFusionEnergy. In other words, we must prove that

$$ \exists\hat{f}^{g_p}_p \in L, p \in\mathcal{P}\quad\Longleftrightarrow\quad\exists g_p \in I, p \in P $$
(12)

where we can assume \(\hat{f}^{g_{p}}_{p} \in\mathcal{L}_{g_{p}}\) due to the way h-fusion works.

Because we assume hierarchical label costs, each \(L \in\mathcal{H}\) belongs to one of four cases derived from Definition 4.

  1. 1.

    If \(L \cap\mathcal{L}_{i} = \emptyset\) for all \(i \in \mathcal{I}(j)\), then we know that any \(\hat{f}^{g_{p}}_{p} \notin L\), then the cost H(L) is not applied in subtree j. By definition I=∅ ensuring δ L (f(g))=δ (g P )=0, which is correct.

  2. 2.

    If \(L \subset\mathcal{L}_{i}\) for some \(i \in\mathcal {I}(j)\), then by definition I={i} and \(P = \{ p : \hat{f}^{i}_{p} \in L \}\). Since \(g_{p} \neq i \Rightarrow\hat{f}^{g_{p}}_{p} \notin L\) and so f(g) contains a label in L if and only if g p =i for some pP. Therefore δ L (f(g))=δ i (g P ) holds in this case.

  3. 3.

    If \(\mathcal{L}_{i} \subseteq L \subset\mathcal{L}_{j}\) for some \(i \in\mathcal{I}(j)\), then clearly \(P = \mathcal{P}\). Since \(L \in\mathcal{H}\) we must also have \(L = \bigcup_{i \in I} \mathcal{L}_{i}\), and so f(g) uses a label in L if an only if g uses a label in I. Therefore δ L (f(g))=δ I (g) holds in this case.

  4. 4.

    If \(L \supseteq\mathcal{L}_{j}\) then \(I = \mathcal{I}(j)\) and \(P=\mathcal{P}\), so H(L) can be added to E′ as a constant or simply ignored.  □

Looking at the proof of Theorem 3 we can see that the structure of h-subsets is especially needed for the third case to hold. If we allow a subset \(L \notin\mathcal{H}\), then α-expansion could not be applied to the resulting E′ because the internal binary steps would be non-submodular and potentially NP-hard. The purpose of Example 2 is to demonstrate why H(L)>0 for arbitrary L can be problematic.

Example 2

Suppose E′ must fuse the two child labelings \(\hat{f}^{1}\) and \(\hat{f}^{2}\) shown below.

figure g

Further suppose that H({ 1, 4})>0 in the original energy. This cost should be paid in E′(g) if and only if any g p variable in region A is assigned child index 1 or any g p in region B is assigned child index 2. However, this potential cannot be encoded as a label cost any sort. Furthermore, encoding H({ 1, 4}) would result in a non-submodular fusion energy. Let E′(i,j) denote the label cost of assigning index i to pixels A and index j to pixels B, then

That E′(i,j) is not submodular follows from inequality (4) if one assumes |A|=|B|=1, and so α-expansion is not applicable inside h-fusion. In fact, because this example has only two labels, we can conclude that the αβ-swap algorithm (Boykov et al. 2001) is also inapplicable to E′.

Finally, we establish that h-metrics give a precise characterization of smoothness costs V that h-fusion can handle.

Theorem 4

The h-fusion algorithm is applicable to V using tree π if and only if (V,π) forms an h-metric.

Proof

Recall from (6) that α-expansion is applicable if and only if V satisfies, for all \(\alpha,\beta,\gamma\in\mathcal{L}\),

$$ V(\alpha,\alpha) + V (\beta,\gamma) \leq V(\alpha,\gamma) + V(\beta,\alpha). $$
(13)

Actually, if (13) holds for \(\beta,\gamma\in \mathcal{L}\setminus\{\alpha\}\) the it trivially holds for all \(\beta,\gamma\in\mathcal{L}\). This observation matters for h-fusion.

In the h-fusion case, each local fusion metric \(V'_{pq}\) on line 2 of ConstructFusionEnergy must satisfy this constraint and so α-expansion can fuse a collection of labelings \(\{\hat{f}^{i}\}_{i \in\mathcal{I}(j)}\) if and only if, for all \(i \in\mathcal{I}(j), \; i',i'' \in\mathcal{I}(j)\setminus\{i\}\),

$$ V\bigl(\hat{f}^i_p, \hat{f}^{i}_q\bigr) + V\bigl( \hat{f}^{i'}_p,\hat{f}^{i''}_q \bigr) \leq V\bigl(\hat{f}^i_p, \hat{f}^{i''}_q\bigr) + V\bigl( \hat{f}^{i'}_p,\hat{f}^{i}_q \bigr). $$
(14)

Note that \(\hat{f}^{i}_{p}\) and \(\hat{f}^{i}_{q}\) could each be any label in \(\mathcal{L}_{i}\) and are not necessarily identical, unlike the α-expansion case. The constraints on the original metric V for h-fusion will therefore be more restrictive than for α-expansion. Since inequality (14) must hold for all possible labelings \(\hat{f}^{i},\hat{f}^{i'}\), and \(\hat{f} ^{i''}\) then it is equivalent to

(15)

Since j=π(i) then inequalities (15) are identical to (9). □

6 Approximation Bounds of h-Fusion

Since α-expansion has an approximation bound and we use it as our main subroutine, it is natural to ask if h-fusion has some bound of its own. If we use a flat tree π and assume E=D+V, then h-fusion reduces to classical α-expansion and we directly inherit the 2c-approximation where

$$ c = \frac{\max_{\alpha,\beta\in\mathcal{L}} V(\alpha,\beta)}{\min_{\gamma\neq\zeta\in\mathcal{L}} V(\gamma,\zeta)}. $$
(16)

Our goal is to derive a generalized bound for h-fusion with arbitrary tree π and arbitrary label costs (i.e. \(\mathcal{H}= 2^{\mathcal{L}}\) in (1)). Like the α-expansion bound in Theorem 2, the quality of our new bound will involve some c and c 2 that depend on the particular energy. As we shall see, these two coefficients can be much smaller for our algorithm. We begin by defining some useful quantities for expressing the h-fusion bound.

Definition 6

Given smoothness costs V and a particular tree π, we define two quantities for each node i:

In other words, \(V^{\max}_{i}\) is the maximum cost for any pair of labels in the subtree of node i, and \(V^{\min}_{i}\) is the minimum cost for two labels from different subtrees descended from i. For example, in Fig. 3a we have \(V ^{\max}_{A} = 2, V^{\min}_{A}=1\) and \(V^{\max}_{B}=4,V ^{\min }_{B}=2\). For the root node r, this example has \(V^{\max}_{r}=4\) and \(V^{\min}_{r} = 3\).

Definition 7

Given an h-metric(V,π) where V is also semi-metric, we define the additional quantities

Observation 3

If π defines a flat tree, then c in Definition 7 reduces to quantity (16) from the α-expansion bound.

The ratio c is most important because it bounds the worst-case approximation error. As we will show, when c is large for standard α-expansion, choosing a non-flat tree can result in a much smaller constant for h-fusion and thereby a better bound. The easiest way to understand how V and π affect c is by looking at a few numeric examples. Figure 6 examines specific values of c for various pairs of smoothness cost matrices V and trees π. These examples suggests that for each choice of V there exists an optimal choice of π to give the best approximation bound. Since for every metric V we can use a flat tree, we can always find a tree for which h-fusion’s bound is at least as good as α-expansion’s.

Fig. 6
figure 9

The top row shows three example smoothness cost matrices. The first is a standard (flat) Potts potential with penalty V 1(,′)=3 for any ′ and so c=1 for a flat tree π a. Metrics V 2 and V 3 have varying penalties and so a flat tree yields c=3 and c=8 respectively. However, by applying h-fusion on tree structures π b and π c respectively (bottom row), we can achieve better c for these particular metrics. The table at right shows other values of c and demonstrates that the choice of tree is important for achieving a good bound

We now consider label costs in h-fusion and generalize the related coefficient c 2. The cardinality of set \(I \subset\mathcal{I}(j)\) on line 4 of ConstructFusionEnergy is an important quantity affecting c 2 for h-fusion. In general, the smaller |I| the better the bound. (Note that we use ⊂ to mean ⊆̷ throughout this paper.)

Definition 8

We define c 2 to be the maximum cardinality of any index set I (ConstructFusionEnergy, line 4) that is a strict subset of \(\mathcal{I}(j)\), minus 1. That is, the constant

where \(I(L) \subset\mathcal{I}(j)\) for some j and \(\bigcup_{i \in I(L)} \mathcal{L}_{i} = L\).

The fact that I(L) will always be the union of some siblings in the tree follows from the assumption that L is an h-subset.

Observation 4

If π defines a flat tree, then c 2 in Definition 8 reduces to the same quantity for α-expansion in Theorem 2.

To see how h-fusion can beat α-expansion at minimizing energies with label costs, consider the following worst-case example for α-expansion.

Example 3

Suppose variables \(\mathcal{P}= \{p_{1},\ldots,p_{n}\}\) and labels \(\mathcal{L}=\{\ell_{1},\ldots,\ell_{n+1}\}\). The data costs D p (⋅) are shown in the table below, where a>0, and smoothness costs are zero. We also assume a label cost H({ 1,… n })=a, as illustrated.

figure h

The labeling \(\hat{f}=(\ell_{n+1},\ldots,\ell_{n+1})\) is a local minimum for α-expansion because no individual variable wants to pay the (shared) label cost in order to switch its label. However the globally optimal labeling is f =( 1,…, n ) and, since \(E(\hat{f}) = n E(f^{*})\), the α-expansion solution can be made arbitrarily bad. The h-fusion algorithm will find f if we use the tree shown above, at right. Notice that c 2=0 for this tree, whereas c 2=n−1 for a flat tree (α-expansion).

For an example of non-trivial c 2, consider again the six-label tree structure:

figure i

If the only label cost H(L)>0 is on subset L={ 5, 6} then I(L)={3} and so c 2=|I(L)|−1=0. If instead we have a label cost on L={ 1, 2, 5, 6} then I(L)={1,3} yielding coefficient c 2=1. Again, the bound of h-fusion is stronger for energies where c 2 is small.

Using our definitions of c and c 2 we state the main theorem of this work: an improvement upon the bound of Delong et al. (2012). For the purposes of the bound we assume D p (⋅)≥0 and that V is semi-metric.

Theorem 5

If f is a global minimum of E=D+V+H with h-metric (V,π) and hierarchical label costs (H,π), then the solution \(\hat{f}\) computed by h-fusion is bounded by

$$ E(\hat{f}) \leq(2 c +c_{2})E\bigl( f^{*}\bigr) + \sum_{L \in\mathcal{H}} H(L) $$
(17)

where c and c 2 are constants given in Definitions 7 and 8 respectively (possibly much smaller than in Theorem 2).

Proof

See Appendix B. □

In the presence of arbitrary label costs, this is still not a constant-ratio approximation bound, but we can construct a worst-case example to show that our bound is indeed tight (see Delong 2011).

7 Application: Hierarchical Color Segmentation

We use hierarchical color segmentation as a simple, illustrative example because it allows us to visualize the effects of hierarchical smooth and label costs. Given an image we wish to group pixels with similar color. We treat segmentation as labeling where each label represents a color; the labels essentially re-colorize the image. However, we explicitly divide the possible colors into groups, and seek a pixel labeling that relies only on a few groups of colors. For example, a natural way to group colors is by hue, and the goal is then to re-color the image using as few hues as possible while staying reasonably faithful to the original image.

The tree below illustrates one possible grouping (hierarchy) of color labels. Each leaf corresponds to a specific color label (e.g. dark red, red, bright red, …) while each subtree corresponds to a group of labels (e.g. reds, blues, greens, grays, …)

figure j

In order to limit the number of hues used in re-colorization we introduce group costs in addition to regular label costs. A cost is associated with each group of colors L and is represented by a label subset cost H(L)>0. It is paid whenever any of the colors in the group is used in the labeling. For the smoothness costs V we use hierarchical Potts smoothness terms between and within color groups to encourage smooth re-colorization. If I p is an image pixel, the data cost D p () is proportional to squared distance between I p and the color represented by label .

We thereby formulate the re-colorization problem as hierarchical energy. We compare α-expansion and h-fusion when applied to this energy. In all the re-colorization experiments our color hierarchy consists of 121 groups of colors, each containing 20 different shades varying from dark to bright. This results in 2420 labels in total. We then demonstrate qualitative (the resulting re-colorizations) and quantitative (running time and energy value) comparisons. In all experiments we set w i =1 and w r =2. Each invocation of α-expansion performed two cycles only (a cycle expands on each label exactly once). This limitation was applied to all instances of α-expansion within h-fusion as well. (Allowing α-expansion to converge takes much longer but only decreases the energy by <0.01 % for both algorithms.)

Consider the input image shown in Fig. 7 top. It has a smooth gradient of color varying from black to yellow through the shades of green and red. Both α-expansion and h-fusion algorithms result in re-colorizations with 4 groups of colors each, namely shades of green, yellow, orange and red (see bottom left and right). However, it can be seen from the result of α-expansion (bottom right) that the optimization got stuck in a local minimum. By expanding on a wrong group of colors first (wrong hue), α-expansion was unable to match the bright portion of the image well. At this point expanding on any one label of a better matched hue did not justify adding the extra group costs associated with this new hue. In the case of h-fusion the algorithm is able to replace one group of hues with another at once and therefore attain a lower energy (see Fig. 7 bottom-left).

Fig. 7
figure 10

Synthetic example. Qualitative and quantitative comparison between α-expansion and h-fusion. The blue line corresponds to energy attained by α-expansion as a function of time. The pink line with diamonds correspond to energies of child-labelings optimized sequentially in the first step of h-fusion. Child-labelings are then stitched together in a fusion step. The energy of the final fusion is shown by the red line. The green line represents the energy of h-fusion if all child-labelings are computed in parallel (Color figure online)

The plot in Fig. 7 provides quantitative comparison between α-expansion and h-fusion in terms of running time and energy values. The blue line corresponds to energy value attained by α-expansion as a function of time. The h-fusion algorithm begins with optimizing a set of sub-energies corresponding to child-labelings. Each child labeling is restricted to one sub-tree of labels and essentially re-colors the image with the colors from that group only. (For example one child-labeling re-colors the image with the shades of red, another with the shades of green, …) The sub-energies are independent and can be optimized either sequentially or in parallel. When sub-energies are optimized sequentially we represent each sub-energy with a pink diamond and plot them as a function of cumulative time. After all sub-energies are optimized, h-fusion algorithm fuses the resulting child-labelings by running α-expansion (starting from the child-labeling with the minimal energy. Again we limit the h-fusion to two cycles only). The energy of h-fusion is represented by the red line and attains a lower energy than regular α-expansion.

Unlike α-expansion, the running time of h-fusion can be dramatically improved by minimizing sub-energies in parallel. This is illustrated by the green line in the plot of Fig. 7. In our specific application the parallel version of h-fusion is faster by a factor of 10–15 compared to sequential h-fusion. At any time in our experiments, the energy curve of parallel h-fusion is dramatically below that of α-expansion, and terminates 20–30 times faster. In theory this speed-up factor should grow linearly with the number of siblings at each level of the label hierarchy. The speed-up is due to the fact that running one expansion cycle with h-fusion is more efficient than with regular α-expansion. This is because the number of unique possible labels in h-fusion corresponds to the number of groups in the hierarchy (121 in our case) while the number of unique labels for α-expansion corresponds to the number of leaves in the hierarchy (2420 colors in our case).

Figure 9 shows similar results for additional input images shown in Fig. 8. For all the experiments sequential h-fusion (pink-red line) attained lower energy and in shorter time than α-expansion (blue line), with even more significant speedup in the case of parallel optimization of the sub-energies in h-fusion (green line).

Fig. 8
figure 11

Input images used for the experiments shown in Fig. 9

Fig. 9
figure 12

Qualitative and quantitative comparison between α-expansion and h-fusion on input images shown in Fig. 8. Again, blue, red-pink and green lines correspond to α-expansion, sequential h-fusion and parallel h-fusion respectively. Note that the re-colorizations obtained with h-fusion are more faithful to the original images than those obtained with α-expansion. For example the Lego piece in the image of a girl is much brighter and, in the parrot image, the parrot’s chest and the tree branch are colored more faithfully (Color figure online)

8 Discussion

The main results of this paper are a characterization of hierarchical costs (h-metrics and h-subsets), the h-fusion algorithm itself, and a significant improvement on the approximation bound of α-expansion. These results are theoretical, but we foresee a number of applications for such energies.

Applications of Hierarchical Costs

We presented hierarchical color segmentation as the simplest possible example that illustrates (a) the nature of energies with hierarchical costs, and (b) the qualitative and quantitative benefits of h-fusion for such energies. However, computer vision is full of problems for which hierarchical costs are natural.

The most obvious is using hierarchical context (e.g. Choi et al. 2010) for image segmentation, where in theory we could group the labels into some appropriate context as depicted below.

figure k

This is a very rudimentary form of context but can be integrated with segmentation via an energy with hierarchical V and H terms.

In vision it is also common to assign labels that have geometric meaning, such as depths (e.g. Boykov et al. 2001; Ladický et al. 2010b), homographies or motions (e.g. Birchfield and Tomasi 1999; Isack and Boykov 2012). For example, Isack and Boykov (2012) start with a set of observations (points, matches, etc.) and use random sampling to generate hundreds of candidate geometric models, much the way ransac does (Fischler and Bolles 1981). They formulate the model fitting problem as a labeling problem where each label represents a candidate model. They find a labeling that corresponds to a good configurations of models, and do this by minimizing an energy of the form E=D+V+H. However, there are many situations where geometric models fall into a natural hierarchy. Figure 10 is a hypothetical example to illustrate this point. Analogous hierarchical relationships exist between, for example, a fundamental matrix (a rigid motion) and the family of homographies (families of correspondences) compatible with that fundamental matrix (Hartley and Zisserman 2003).

Fig. 10
figure 13

Depiction of how hierarchical line fitting might work with an energy of the form E=D+H. Each label corresponds to a possible line (e.g. from random sampling), and each point wants to be labeled by a nearby line. Label cost H() discourages line from being used unless there are enough supporting points—otherwise the points take the outlier label (constant penalty per point). However, if we group lines by orientation, we could add costs H(L) where L is a family of lines, encouraging solutions that use a few families of parallel lines

Furthermore, hierarchical costs can be useful for detecting patterns, for compression, and for learning a database of inter-dependent patches from images (Gorelick et al. 2011). Outside vision, Sefer and Kingsford (2011) showed that the r-hst metrics of Kumar and Koller are effective at identifying protein function; our work could extend their results.

Relation to r-hst Metrics

Recall that, at a high level, the h-fusion process shown in Fig. 5 is the same as that used by Kumar and Koller (2009). Given a metric V, they find the set of r-hst metrics that best approximates V and try to minimize an energy of the form E=D+V using a bottom-up fusion process. The main idea of an r-hst metric is as follows. Assume we are given a tree with distances d(i,j) defined on each edge from child i to parent j. Assume that the distance from j to all its children is uniform, i.e. d(i,j)=d(i′,j) for all \(i,i' \in\mathcal{I}(j)\). Further assume that we know the parent-to-child distance gets cheaper by a factor of r as we descend the tree, i.e. \(\frac{d(i,j)}{d(k,i)} \ge r\) for some constant r>1. The total distance between two leaf nodes α and β is the cumulative sum of edge distances along the path from α to β in the tree. If the ‘costs’ of a pairwise potential V(α,β) correspond to such a distance function for all α,β, then V is said to be an r-hst metric.

Our concept of an h-metric is expressed directly in terms of constraints on V(⋅,⋅), not on edges or distances traversed in the tree. Furthermore, r-hst metrics are a strict subset of h-metrics (see Appendix A).

Generalizing Facility Location

In the optimization and operations research communities, uncapacitated facility location (UFL) is a well-studied problem (e.g. Shmoys et al. 1998). UFL assigns a ‘facility’ to serve each client such that the cost to clients and the cost of opening facilities is jointly minimized. UFL is connected to our energy because if we let \(\mathcal{L}\) denote the facilities and \(\mathcal{P}\) denote the clients then every problem instance can be expressed as minimizing an energy of the form

$$ E(f) = \sum_{p \in\mathcal{P}} D_p(f_p) + \sum_{\ell\in\mathcal{L}} H( \ell)\delta_\ell(f). $$
(18)

In vision, the UFL objective has recently been applied to motion segmentation by Li (2007) and by Lazic et al. (2009), but goes all the way back to Torr and Murray (1994).

There exist variants of UFL that allow for a hierarchy of facilities, e.g. Svitkina and Tardos (2006) and Sahin and Süral (2007). This generalization allows for more realistic modeling of complex interdependencies between facilities themselves. Some of these works derive constant-factor approximation bounds for hierarchical facility location, e.g. Kantor and Peleg (2009), but all such works assume metric client costs where the costs D p (⋅) are computed as distances from a particular center. Without this assumption, Feige’s hardness result still holds. Strategies for optimizing hierarchical UFL include linear programming relaxation, primal-dual algorithms and, very recently, message passing algorithms (Givoni et al. 2011).

We can encode a kind of hierarchical facility cost with our framework as follows. Suppose facilities 1 and 2 require the services of facility 3, which costs 50 to open. A label cost H({ 1, 2, 3}):=50 correctly accounts for the shared dependency of 1 and 2 on 3. If we furthermore have a facility 4 that depends on both 3 and some facility 5 (cost 80), then our label costs should instead be H({ 1, 2, 3, 4}):=50 and H({ 4, 5}):=80.

Furthermore, our h-fusion algorithm can handle smoothness costs V, which to the best of our knowledge are novel for UFL. In the UFL setting, V(f p ,f q ) can encode an explicit preference that clients p and q be serviced by the same facility. When clients are social, there are many scenarios where such a preference makes sense. When client costs D are metric (e.g. Euclidean distance) then this preference is implicitly encoded in D. However, when the client costs are not metric, such as clients connected by an irregular network despite being physically close, then our smoothness costs V may be useful for modeling such problems.

Improving the Bound

Recall from Observation 1 that for the Set-Cover problem the best we can hope for is a \(\ln|\mathcal{P}|\)-approximation. Yet one can formulate Set-Cover using an energy of the form (18), so minimizing energy E=D+V+H is at least as hard. However, Hochbaum (1982) gave a simple greedy algorithm for Set-Cover and proved that it yields precisely a \(\ln |\mathcal{P}|\)-approximation, the best possible according to Feige (1998). If label costs are arbitrary in (18), then α-expansion’s bound is also arbitrarily bad. So, there is a huge gap between what α-expansion can achieve on (18) versus what Hochbaum’s greedy algorithm can guarantee. For energies of the form E=D+H, it may be possible to extend Hochbaum’s algorithm and use it as a subroutine within h-fusion(rather than using α-expansion). One may ask if h-fusion could inherit a better approximation bound in that case. We also do not know if her approach can be applied in the presence of smoothness costs V.

Relation to Genetic Algorithms

Within our framework, the inner α-expansion subroutine is performing a sequence of fusion moves like proposed by Lempitsky et al. (2010). We point out that a binary fusion move is essentially an optimized crossover operation, already used to some success in genetic algorithms (Aggarwal et al. 1997; Meyers and Orlin 2007). A standard concern for genetic algorithms is how to maintain population diversity so that, when two chromosomes (labelings) are crossed, there is a chance that the descendant will be better. Our h-fusion process forces a kind of population diversity based on a tree: the labelings in our multi-label fusion each contain labels from different subtrees. It is interesting that this structured-diversity gives a provably better approximation bound in our case.