1 Introduction

Minimum st cut (mincut) is a classical combinatorial problem with applications in many areas of science and engineering. This research was motivated by the wide use of mincut/maxflow problems in computer vision, where large sparse instances need to be solved. We start by a more detailed overview of models and optimization techniques in vision, where the mincut problem is employed and give examples of our test problems.

1.1 mincut in Computer Vision

In some cases, an applied problem is formulated directly as a mincut. More often, however, mincut problems in computer vision originate from the Energy minimization framework (maximum a posteriori solution in a Markov random field model). Submodular Energy minimization problems completely reduce to mincut (Ishikawa 2003; Schlesinger and Flach 2006). When the energy minimization is intractable, mincut is employed in relaxation and local search methods. The linear relaxation of pairwise Energy minimization with 0-1 variables reduces to mincut (Boros et al. 1991; Kolmogorov and Rother 2007) as well as the relaxation of problems reformulated in 0-1 variables (Kohli et al. 2008). Expansion-move, swap-move (Boykov et al. 1999) and fusion-move (Lempitsky et al. 2010) algorithms formulate a local improvement step as a mincut problem.

Many applications of mincut in computer vision use graphs of a regular structure, with vertices arranged into an N-D grid and edges uniformly repeated, e.g., 3D segmentation models illustrated in Fig. 1(c), 3D reconstruction models, Fig. 1(b). Because of such regular structure, the graph itself need not be stored in the memory, only the edge capacities, allowing relatively large instances to be solved by a specialized implementation. However, in many cases, it is advantageous to have a non-regular structure, e.g., in stereo with occlusions in Fig. 1(a), in 3D reconstruction with adaptive tetrahedral volume (Labatut et al. 2009; Jancosek and Pajdla 2011). Such applications would benefit from a large-scale generic mincut solver.

Fig. 1
figure 1

Examples of labeling problems in computer vision solved via maxflow. (a) Stereo and stereo with occlusions (Boykov et al. 1998; Kolmogorov and Zabih 2001). (b) 3D reconstruction (Lempitsky et al. 2006; Boykov and Lempitsky 2006) and surface fitting (Lempitsky and Boykov 2007). (c) 3D segmentation (Boykov and Jolly 2001; Boykov and Funka-Lea 2006; Boykov and Kolmogorov 2003). The instances are published at the University of Western Ontario web pages (2008) for benchmarking maxflow implementations

1.2 Distributed Computation

The previous research mostly focused on speeding up mincut by parallel computation in the shared memory model. We consider a distributed memory model, which assumes that the computation units have their own separate memory and exchanging the information between them is expensive. A distributed algorithm has therefore to divide the computation and the problem data between the units and keep the communication rate low. We will consider distributed algorithms, operating in the following two practical usage modes:

  • Sequential (or streaming) mode, which uses a single computer with a limited memory and a disk storage, reading, processing and writing back a portion of data at a time.

  • Parallel mode, in which the units are thought as computers in a network.

We propose new algorithms for both cases, prove their correctness and termination guarantees. In the assessment of complexity, we focus on the costly operations such as load-unload of the data in the streaming mode or message exchanges in the parallel mode. More specifically, we call a sweep the event when all units of a distributed algorithm recalculate their data once. Sweeps in our algorithms correspond to outer iterations and their number is roughly proportional to the amount of communication in the parallel mode or disk operations in the streaming mode. While there are algorithms with better bounds in terms of elementary operations, our algorithms achieve lower communication rates.

1.3 Previous Work

A variant of path augmentation algorithm was shown by Boykov and Kolmogorov (2004) to have the best performance on computer vision problems among sequential solvers. There were several proposals how to parallelize it. Partially distributed implementation (Liu and Sun 2010) augments paths within disjoint regions first and then merges regions hierarchically. In the end, it still requires finding augmenting paths in the whole problem. Therefore, it cannot be used to solve a large problem by distributing it over several computers or by using a limited memory and a disk storage. For the shared memory model Liu and Sun (2010) reported a near-linear speed-up with up to 4 CPUs for 2D and 3D segmentation problems.

Strandmark and Kahl (2010) obtained a distributed algorithm using a dual decomposition approach. The subproblems are mincut instances on the parts of the graph (regions) and the master problem is solved using the subgradient method. This approach requires solving mincut subproblems with real valued capacities and does not have a polynomial bound on the number of iterations. The integer algorithm proposed by Strandmark and Kahl (2010) is not guaranteed to terminate. Our experiments (Sect. 7.3) showed that it did not terminate on some of the instances in 1000 sweeps. In Sect. 10, we relate dual variables in this method to flows.

The push-relabel algorithm (Goldberg and Tarjan 1988) performs many local atomic operations, which makes it a good choice for a parallel or distributed implementation. A distributed version (Goldberg 1991) runs in O(n 2) time using O(n) processors and \(O(n^{2}\sqrt{m})\) messages, where n is the number of vertices and m is the number of edges in the problem. However, for a good practical performance it is crucial to implement gap relabel and global relabel heuristics (Cherkassky and Goldberg 1994). The global relabel heuristic can be parallelized (Anderson and Setubal 1995), but it is difficult to distribute. We should note, however, that the global relabel heuristic was not essential in the experiments with computer vision problems we made (Sect. 7.2). Delong and Boykov (2008) proposed a coarser granulation of push-relabel operations, associating a subset of vertices (a region) to each processor. Push and relabel operations inside a region are decoupled from the rest of the graph. This allows to process several non-interacting regions in parallel or run in a limited memory, processing few regions at a time. The gap and relabel heuristics, restricted to the regions (Delong and Boykov 2008) are powerful and distributed at the same time. Our work was largely motivated by Delong and Boykov (2008) and the notice that their approach might be extendible to augmenting path algorithms. However, our first attempt to prioritize augmentation to the boundary vertices by the shortest distance to the sink did not lead to a correct algorithm.

1.4 Contribution

In this work we revisit the algorithm of Delong and Boykov (2008) for the case of a fixed partition into regions. We study a sequential variant and a novel parallel variant of their algorithm, which allows computation on neighboring interacting regions to run concurrently using a conflict resolution similar to the asynchronous parallel push-relabel (Goldberg 1991). We prove that both variants have a tight O(n 2) bound on the number of sweeps. The new algorithm we construct works with the same partition of the graph into regions but is guided by a different distance function than the push-relabel one.

Given a fixed partition into regions, we introduce a distance function which counts the number of region boundaries crossed by a path to the sink. Intuitively, it corresponds to the amount of costly operations—network communications or loads-unloads of the regions in the streaming mode. The algorithm maintains a labeling, which is a lower bound on the distance function. Within a region, we first augment paths to the sink and then paths to the boundary vertices prioritized by the lowest label. Thus the flow is pushed out of the region in the direction given by the distance estimate. We present a sequential and parallel versions of the algorithm which terminate in \(O(|\mathcal {B}|^{2})\) sweeps, where \(\mathcal {B}\) is the set of all boundary vertices (incident to inter-region edges).

The proposed algorithms are evaluated on instances of mincut problems collected and published by the Computer Vision Research Group at the University of Western Ontario (illustrated in Fig. 1). The results are compared against the state-of-the-art sequential and parallel solvers. We also studied the behavior of the algorithms w.r.t. problem size, granularity of the partition, etc. Our implementation is publicly available at http://cmp.felk.cvut.cz/~shekhovt/d_maxflow.

1.5 Other Related Work

The following works do not consider a distributed implementation but are relevant to our design. The Partial Augment-Relabel algorithm (PAR) by Goldberg (2008) augments in each step a path of length k. It may be viewed as a lazy variant of push-relabel, where actual pushes are delayed until it is known that a sequence of k pushes can be executed. The algorithm by Goldberg and Rao (1998) incorporates the notion of a length function and a valid labeling w.r.t. this length. It can be seen that the labeling maintained by our algorithm corresponds to the length function assigning 1 to boundary edges and 0 to intra-region edges. Goldberg and Rao (1998) used such generalized labeling in the context of the blocking flow algorithm but not within the push-relabel.

2 mincut and Push-Relabel

We solve mincut problem by finding a maximum preflow.Footnote 1 In this section, we give basic definitions and introduce the push-relabel framework of Goldberg and Tarjan (1988). While we assume the reader is familiar with mincut/maxflow, we explain some known results using the notation adjusted for the needs of this paper.

In the classical framework of minimum cut and maximum flow, the flow augmentation transforms a minimum cut problem into an equivalent one on the residual network (preserving costs of all cuts up to a constant). However, there is no equivalent minimum cut problem corresponding to an augmentation of a preflow. In the push-relabel approach of Goldberg and Tarjan (1988), this is not essential, as only single residual arcs need to be considered and algorithms can be formulated as working with a pair of a network and a preflow. In this paper, we need to work with residual paths and the reachability in the residual network. We therefore use the extended definition of the minimum cut problem, which includes a flow excess (or supply) in every vertex. After this extension, the family of equivalent min-cut problems becomes closed under preflow augmentations. This allows us to formulate algorithms more conveniently as working with the current residual network and constructing a preflow increment. This point is illustrated in Fig. 2.

Fig. 2
figure 2

(a) Example of a network with indicated edge capacity function. (b) Augmenting path approach: send flow from the source to the sink along a path. The residual network defines an equivalent min-cut problem. (c) Push-relabel approach: the preflow is pushed over arcs in all directions, prioritized by the shortest distance to the sink. The equivalent min-cut problem is defined by a network with excess

By a network we call the tuple G=(V,E,s,t,c,e), where V is a set of vertices; EV×V is the set of edges, thus (V,E) is a directed graph; s,tV, st, are source and sink, respectively; c:E→ℕ0 is a capacity function; and e:V→{0,1,…,∞}, e(t)=0, e(s)=∞ is an excess function. We also denote n=|V| and m=|E|.

For any sets X,YV we will denote (X,Y)=E∩(X×Y). For CV such that sC, tC, the set of edges \((C,\bar{C})\), with \(\bar{C} = V\backslash C\) is called an st cut. The mincut problem is

$$ \min \biggl\{ \sum_{(u,v)\in (C,\bar{C})} c(u,v){+}\sum_{v\in \bar{C} } e(v) \bigm {|}C\subset V, \,s\in C,\,t\in \bar{C} \biggr\}. $$
(1)

The objective is called the cost of the cut. Note, that excess in this problem can be equivalently represented as additional edges from the source, but we prefer the explicit form. Without a loss of generality, we assume that E is symmetric—if not, the missing edges are added and assigned a zero capacity.

A preflow in G is the function f:E→ℤ satisfying the following constraints:

(2a)
(2b)
(2c)

The constraint (2b) removes the redundancy in the otherwise independent flow values on (u,v) and (v,u) (positive flows should naturally cancel each other) and shortens the equations at the same time.

A residual network w.r.t. preflow f is a network G f =(V,E,s,t,c f ,e f ) with the capacity and excess functions given by

(3a)
(3b)

By constraints (2a), (2b), (2c) it is c f ≥0 and e f ≥0. The costs of all st cuts differ in G and G f by a constant called the flow value, |f|=∑ u∣(u,t)∈E f(u,t). This can be easily verified by substituting c f and e f into (1) and expanding. Network G f is thus equivalent to network G up to the constant |f| and since all cuts in G f are non-negative, |f| is a lower bound on the cost of a cut in G. The problem of maximizing this lower bound, i.e. finding a maximum preflow:

$$ \max_{f} |f| \quad \mbox {s.t. }\quad \mbox{constraints~(2a), (2b), (2c)} $$
(4)

is dual to mincut. The value of a maximum preflow equals to the cost of a minimum cut and the solutions are related as explained below.

We say that wV is reachable from vV in the network G f if there is a path (possibly of length 0) from v to w composed of edges with strictly positive residual capacities c f (a residual path). This relation is denoted by vw.

Let us consider a residual path from v to w such e f (v)>0. Augmentation increases the flow by Δ>0 on all forward edges of the path, and decreases it on all reverse edges, where Δ does not exceed the residual capacities of the forward arcs or e f (v). In the result, the excess e f (v) is decreased and excess e f (w) is increased. Augmenting paths to the sink increases the flow value. In the augmenting path approach, the problem (4) is solved by repeatedly augmenting residual paths from vertices having excess (e.g., source) to the sink.

If w is not reachable from v in G f we write vw. For any X,YV, we write XY if there exist xX, yY such that xy. Otherwise we write XY.

A preflow f is maximum iff there is no residual path to the sink which can be augmented. This can be written as {ve f (v)>0}↛t in G f . In this case the cut \((\bar{T}, T)\) with T={vVvt in G f } has value 0 in G f . Because all cuts are non-negative it is a minimum cut.

2.1 General Push-relabel

A Distance function d :V→{0,1,…,n} in G f assigns to vV the length of the shortest residual path from v to t, or n if no such path exists. A shortest path cannot have loops, thus its length is not greater than n−1. Let us denote d =n.

A labeling d:V→{0,1,…,d } is valid in G f if d(t)=0 and d(u)≤d(v)+1 for all (u,v)∈E such that c f (u,v)>0. Any valid labeling is a lower bound on the distance d in G f , however not every lower bound is a valid labeling.

A vertex v is called active w.r.t. (f,d) if e f (v)>0 and d(v)<d .

The definitions of reachability and validity are given w.r.t. the residual network G f , however expressions like “vw in G” or “d is valid in G” are also correct, and will be needed later in the paper. In particular, we will consider algorithms making some large steps, where a preflow increment f is computed and then applied to the initial network by assigning G:=G f . After that, the algorithm continues with G and resets f.

To ensure that residual paths do not go through the source and for reasons of efficiency, we make all edges from the source saturated during the Procedure 1 (an initialization common to all algorithms in this paper).

Procedure 1
figure 3

Init

The generic push-relabel algorithm by Goldberg and Tarjan (1988) maintains a preflow f and a valid labeling d. It starts with Init and applies the following Push and Relabel operations while possible:

  • Push(u,v), which is applicable if u is active and c f (u,v)>0 and d(u)=d(v)+1. The operation increases f(u,v) by Δ and decreases f(v,u) by Δ, where Δ=min(e f (u),c f (u,v)).

  • Relabel(u), which is applicable if u is active and ∀v∣(u,v)∈E, c f (u,v)>0 it is d(u)≤d(v). It sets d(u):=min(d ,min{d(v)+1∣(u,v)∈E, c f (u,v)>0}).

If u is active then either Push or Relabel operation is applicable to u. The algorithm preserves validity of the labeling and stops when there are no active vertices. Then for any u such that e f (u)>0, we have d(u)=d and therefore d (u)=d and ut in G f , so f is a maximum preflow.

3 Region Discharge Revisited

We now review the approach of Delong and Boykov (2008) and reformulate it for the case of a fixed graph partition. We then introduce generic sequential and parallel algorithms which will be applied with both push-relabel and augmenting path approaches.

Delong and Boykov (2008) introduce the following operation. The discharge of a region RV∖{s,t} applies Push and Relabel to vR until there are no active vertices left in R. This localizes computations to R and its boundary, defined as

$$ B^R = \bigl\{w \mid \exists u\in R\ (u,w) \in E, w\notin R,\ w\neq s,t \bigr\}. $$
(5)

When a Push is applied to an edge (v,w)∈(R,B R), the flow is sent out of the region. We say that two regions R 1,R 2V∖{s,t} interact if R 1R 2≠∅ or \(R_{1} \cap B^{R_{2}} \neq \emptyset\), that is they share vertices or they are connected by an edge. Because Push and Relabel operations work on the individual edges, discharges of non-interacting regions can be performed in parallel. The algorithm proposed by Delong and Boykov (2008) repeats the following steps until there are no active vertices in V:

  1. 1.

    Select several non-interacting regions, containing active vertices.

  2. 2.

    Discharge the selected regions in parallel, applying region-gap and region-relabel heuristics.

  3. 3.

    Apply the global gap heuristic.

All heuristics (global-gap, region-gap, region-relabel) serve to improve the distance estimate. They are very important in practice, but do not affect the theoretical properties and will be discussed in Sect. 5 devoted to the implementation.

3.1 Region Network

We now take a different perspective on the algorithm. We consider each region discharge as a proper subproblem to be solved. Given the states of the boundary edges on the input (labels and excess), region discharge of region R returns a flow and a labeling. To define it formally, we first single out a subnetwork on which region discharge will work.

A region network G R=(V R,E R,s,t,c R,e R) has the set of vertices V R=RB R∪{s,t}; set of edges E R=(R∪{s,t},R∪{s,t})∪(R,B R)∪(B R,R R); capacities c R(u,v)=c(u,v) if (u,v)∈E R∖(B R,R) and 0 otherwise; and excess e R=e| R∪{s,t} (the restriction of function e to its subdomain R∪{s,t}). This subnetwork is illustrated in Fig. 3(b). Note that the capacities of edges coming from the boundary, (B R,R), are set to zero. Indeed, these edges belong to a neighboring region network. The region discharge operation of Delong and Boykov (2008), which we refer to as Push-relabel Region Discharge (PRD), can now be defined as shown in Procedure 2.

Fig. 3
figure 4

(a) Partition of a network into 4 regions and the boundary set \(\mathcal {B}\) depicted by stars. (b) The region network corresponding to the highlighted region in (a)

Procedure 2
figure 5

PRD(G R,d)

3.2 Generic Region Discharging Algorithms

While Delong and Boykov (2008) selected the regions dynamically, trying to divide the work evenly between CPUs in each iteration and cover the most of the active vertices, we restrict ourselves to a fixed collection of regions \((R^{k})_{k=1}^{K}\) forming a partition (disjoint union) of V∖{s,t}. With respect to this partition we will use shortened notations B k, G k, V k, E k, c k, e k to denote the corresponding region boundary \(B^{R_{k}}\), region network \(G^{R_{k}}\) and its respective components. We also define the boundary \(\mathcal {B}= \bigcup_{k} B^{k}\), which is the set of all vertices incident to inter-region edges (Fig. 3(a)).

We now define generic sequential and parallel algorithms which use a black-box Discharge function as a subroutine. The sequential algorithm (Algorithm 1) takes regions one-by-one from the partition and applies the Discharge operation to them until there are no active vertices in either region. The parallel algorithm (Algorithm 2) calls Discharge for all regions concurrently and then resolves conflicts in the flow similarly to the asynchronous parallel push-relabel of Goldberg and Tarjan (1988). A conflict occurs if two interacting regions increase their labels on the vertices of a boundary edge (u,v) simultaneously and try pushing flow over it (in their respective region networks). In such a case, we accept the labels, but do not allow the flow to cross the boundary in one of the directions by the following construction. In step 5 of Algorithm 2, boundary edges, where the validity condition is potentially violated, are assigned α=0 (here, \(\mathopen{\rlap{[}\hskip1.3pt[}\mathclose{\rlap{]}\hskip1.3pt]}\) is the Iverson bracket). The flow fusion in step 6 disables the flow on such edges (the flow going “upwards”). As will be proven later, this correction restores the validity. The actual implementation does not maintain the full network G, only the separate region networks. This is in contrast to Delong and Boykov (2008), who perform all operations in the global network G.

Algorithm 1
figure 6

Sequential Discharging

Algorithm 2
figure 7

Parallel Discharging

In the case when the abstract Discharge procedure is implemented by PRD, the sequential and parallel algorithms correspond to the push-relabel approach and will be referred to as S-PRD and P-PRD respectively. S-PRD is a sequential variant of Delong and Boykov (2008) and P-PRD is a novel parallel variant. As was mentioned above, the original algorithm by Delong and Boykov (2008) allows to discharge only non-interacting regions in parallel (in this case there are no conflicts). To discharge all regions, this approach would require sequentially selecting subsets of non-interacting regions for processing.Footnote 2 Our parallel algorithm applies ideas of Goldberg and Tarjan (1988) to remove this limitation and process all regions in parallel.

We prove below that both S-PRD and P-PRD terminate with a valid labeling in at most 2n 2 sweeps. Parallel variants of push-relabel (Goldberg 1987) have the same bound on the number of sweeps. However, they perform much simpler sweeps, processing every vertex only once, compared to S/P-PRD. A natural question is whether O(n 2) bound cannot be tightened for S/P-PRD. In Sect. 9, we give an example of a graph, its partition into two regions and a valid sequence of Push and Relabel operations, implementing S/P-PRD which takes Ω(n 2) sweeps to terminate.Footnote 3 The number of inter-region edges in this example is constant, which shows that a better bound in terms of this characteristic is not possible.

3.3 Complexity of Sequential Push-relabel Region Discharging

Our proof follows the main idea of the similar result for parallel push-relabel by Goldberg (1987). The main difference is that we try to keep Discharge operation as abstract as possible. Indeed, it will be seen that proofs of termination of other variants follow the same pattern, using several important properties of the Discharge operation, abstracted from the respective algorithm. Unfortunately, to this end we do not have a unified proof, so we will analyze all cases separately.

Statement 1

(Properties of PRD)

Let (f′,d′)=PRD(G R,d), then

  1. 1.

    there are no active vertices in R w.r.t. (f′,d′) (optimality),

  2. 2.

    d′≥d; \(d'|_{B^{R}} = d|_{B^{R}}\) (labeling monotony),

  3. 3.

    d′ is valid in \(G^{R}_{f'}\) (labeling validity),

  4. 4.

    f′(u,v)>0⇒d′(u)>d(v), ∀(u,v)∈E R (flow direction).

Proof

1. Optimality. This is the stopping condition of PRD.

2, 3. Labeling validity and monotony: labels are never decreased and the Push operation preserves labeling validity (Goldberg and Tarjan 1988). Labels not in R k are not modified.

4. Flow direction: let f′(u,v)>0, then there was a push operation from u to v in some step. Let \(\tilde{d}\) be the labeling in this step. We have \(\tilde{d}(u) = \tilde{d}(v)+1\). Because labels never decrease, \(d'(u)\geq \tilde{d}(u)> \tilde{d}(v) \geq d(v)\). □

These properties are sufficient to prove correctness and the complexity bound of S-PRD. They are abstract from the actual sequence of Push and Relabel operation performed by PRD and for a given pair (f′,d′) they are easy to verify. For correctness of S-PRD we need to verify that it maintains a labeling, which is globally valid.

Statement 2

Let d be a valid labeling in G. Let f′ be a preflow in G R and d′ be a labeling satisfying properties 2 and 3 of Statement 1. Extend f′ to E by letting \(f'|_{E\backslash E^{R}}=0\) and extend d′ to V by letting \(d'|_{V\backslash V^{R}} = d|_{V\backslash V^{R}}\). Then d′ is valid in G f.

Proof

We have that d′ is valid in \(G^{R}_{f'}\). For edges outside the region network, (u,v)∈(VR,VR), it is f′(u,v)=0 and d′ coincides with d on VR. It remains to verify validity on the boundary edges (v,u)∈(B R,R) in the case \(c^{R}_{f}(v,u)=0\) and c f (v,u)>0. These are the incoming boundary edges which are zeroed in the network G R. Because \(0 = c^{R}_{f}(v,u)= c^{R}(v,u) - f(v,u) = -f(v,u)\), we have c f (v,u)=c(v,u). Since d was valid in G, d(v)≤d(u)+1. The new labeling d′ satisfies d′(u)≥d(u) and d′(v)=d(v). It follows that d′(v)=d(v)≤d(u)+1≤d′(u)+1. Hence d′ is valid in G f. □

Similar to Goldberg (1987), we introduce the potential function

$$ \varPhi = \max \bigl\{d(v)\mid v\in V,\ \mbox{$v$ is active in $G$} \bigr\}. $$
(6)

This value may increase and decrease during the algorithm run, but the total number of times it can change is bounded. We first show that for a region discharge on R its increase is bounded by the total increase of the labeling.

Statement 3

Let (f′,d′) satisfy properties 2-4 of Statement 1. Let f′ be extended to E by setting \(f'|_{E\backslash E^{R}} = 0\) and d′ be extended to V by setting \(d'|_{V\backslash V^{R}} = d|_{V\backslash V^{R}}\). Let G′=G f and Φ′ be the new potential computed for the network G′ and labeling d′. Then

$$ \varPhi' - \varPhi \leq \sum_{v\in R} \bigl[d'(v)-d(v)\bigr]. $$
(7)

Proof

Let the maximum in the definition of Φ′ be attained at a vertex v, so Φ′=d′(v). Then either vV R, in which case Φ′≤Φ (because the label and the excess of v in G and G′ are the same), or vV R and there exists a path (v 0,v 1,…,v l ), v l =v, v 0,…,v l−1R, such that f′(v i−1,v i )>0, i=1,…,l and v 0 is active in G. We have Φd(v 0), therefore

(8)

where inequality (a) is due to the flow direction property (Statement 1.4) which implies d′(v i−1)>d(v i ). The inequality (b) is due to monotony property (Statement 1.2) and due to v i RB R. The equality (c) is due to \(d'|_{B^{R}} = d|_{B^{R}}\). □

We can now state the termination.

Theorem 1

S-PRD terminates in at most 2n 2 sweeps.

Proof

Labeling d does not exceed n for every vertex. Because there are n vertices, d can be increased n 2 times at most.

From Statement 3 it follows that the increase of Φ after the discharge of region R k is bounded by the total increase of \(d|_{R^{k}}\). Since regions are disjoint, the total increase of Φ after a sweep of S-PRD is bounded by the total increase of d.

If d has not increased during a sweep (d′=d) then Φ decreases at least by 1. Indeed, let us consider the set of vertices having the label greater or equal to the label of the highest active vertex, H={vd(v)≥Φ}. These vertices do not receive flow during all discharge operations due to the flow direction property. After discharging R k, there are no active vertices in R kH (statement 1.1). Therefore, there are no active vertices in H after the full sweep.

In the worst case, Φ can increase by one n 2 times and decrease by one n 2 times. Therefore, there are no active vertices in G in at most 2n 2 sweeps and the algorithm terminates. □

When the algorithm terminates, it outputs a network G, equivalent to the initial one, a labeling d valid in G and guarantees that there are no active vertices w.r.t. d. This implies that in the current network G there are no paths form the vertices with excess to the sink and the cut \((\bar{T},T)\), with T={vvt in G} is one of the minimum cuts. The issue how to compute the reachability vt in G in a distributed fashion, utilizing d, rather than by breadth-first search in G is discussed in Sect. 5.2. This is the purpose of the last step in both of the algorithms.

3.4 Complexity of Parallel Push-relabel Region Discharging

We will now prove the validity and termination of the parallel algorithm using the results of previous section. Properties similar to Statement 1 will be proven for the fused flow and labeling (constructed at step 6 of Algorithm 2) and the bound on the increase of the potential will follow for the whole network as if it was a single region.

Statement 4

Let d be a valid labeling in the beginning of a sweep of P-PRD. Then the pair of fused flow and labeling (f′,d′) satisfies:

  1. 1.

    d′≥d (labeling monotony)

  2. 2.

    d′ is valid in G f (labeling validity)

  3. 3.

    f′(u,v)>0⇒d′(u)>d(v), ∀(u,v)∈E (flow direction).

Proof

1. We have \(d'|_{R^{k}} \geq d|_{R^{k}}\) for all k.

2. We have to prove validity for the boundary edges, where the flow and the labeling are fused from different regions. It is sufficient to study the two regions case. Denote the regions R 1 and R 2. The situation is completely symmetric w.r.t. orientation of a boundary edge (u,v). Let uR 1 and vR 2. Let only d′(v)≤d′(u)+1 be satisfied and not d′(u)≤d′(v)+1. By the construction in step 6 of Algorithm 2, flow f 2 is canceled and \(f'(u,v) = f'_{1}(u,v) \geq 0\). Suppose \(c_{f_{1}'}(u,v) >0\), then we have that \(d_{1}'(u) \leq d_{1}'(v)+1\), because \(d_{1}'\) is valid in \(G^{1}_{f'_{1}}\). It follows that \(d'(u) = d_{1}'(u) \leq d_{1}'(v)+1 = d(v)+1 \leq d_{2}'(v)+1 = d'(v)+1\), where we also used labeling monotonicity property. The inequality d′(u)≤d′(v)+1 is a contradiction, therefore it must be that c f(u,v)=0. The labeling d′ is valid on (u,v) in this case. Note that inequalities d′(v)≤d′(u)+1 and d′(u)≤d′(v)+1 cannot be violated simultaneously. In the remaining case, when both inequalities are satisfied, the labeling is valid for arbitrary flow on (u,v), so no flow is canceled in the flow fusion step.

3. If f′(u,v)>0 then \(f'_{k}(u,v)>0\) and there was a push operation from u to v in the discharge of region R ku. Let \(\tilde{d}_{k}\) be the labeling in G k on this step. We have \(d'(u) \geq \tilde{d}_{k}(u) = \tilde{d}_{k}(v)+1 \geq d(v)+1 > d(v)\). □

Theorem 2

P-PRD terminates in at most 2n 2 sweeps.

Proof

As before, the total increase of d is at most n 2. As shown above, the labeling monotony, labeling validity and flow direction are satisfied for the fused flow and labeling (f′,d′) on the region R=V∖{s,t}. Applying Statement 3, we get that the total increase of potential is bounded above by the total increase of d during a sweep.

If d has not increased during a sweep (d′=d) then α(u,v)=1 for all \((u,v)\in (\mathcal {B},\mathcal {B})\) (all boundary pairs are valid). Flow direction property implies that the flow goes only “downwards” the labeling. So no flow is canceled on the fusion step. Let H={vd(v)≥Φ}. These vertices are above any active vertices, so they cannot receive flow. After the sweep, all active vertices which were in H are discharged and must become inactive. Because there is no active vertices with label Φ or above left, it is Φ′<Φ. It follows that the algorithm will terminate in at most 2n 2 sweeps. □

4 Augmented Path Region Discharge

In this section, we introduce the core of our new algorithm, which combines path augmentation and push-relabel approaches. We will give a new definition to the distance function and validity of a labeling and introduce the new Discharge operation to be used within the generic sequential and parallel algorithms (Algorithms 1 and 2). With these modifications the algorithms will be proven correct and posses a tighter bound on the number of sweeps.

4.1 A New Distance Function

Consider the fixed partition \((R^{k})_{k=1}^{K}\). Let us introduce a distance function, which counts only inter-region edges and not inner edges. The region distance \(d^{*\mathcal {B}}(u)\) in G is the minimal number of inter-region edges contained in a path from u to t, or \(|\mathcal {B}|\) if no such path exists:

$$ d^{*\mathcal {B}}(u) = \begin{cases} \min_{P = ((u,u_1),\dots, (u_r,t)) } |P \cap (\mathcal {B},\mathcal {B})| & \mbox {if}\ u\rightarrow t, \\ |\mathcal {B}| & \mbox {if}\ u\nrightarrow t. \end{cases} $$
(9)

This distance corresponds well to the number of region discharge operations required to transfer the excess to the sink (see Fig. 4(a)).

Fig. 4
figure 8

(a) Illustration of region distance. (b) Illustration of Lemma 1: augmentation on paths from x to u or from v to y preserves XY, but not the augmentation on the red path

Statement 5

If ut then \(d^{*\mathcal {B}}(u) < |\mathcal {B}|\).

Proof

Let P be a path from u to t given as a sequence of edges. If P contains a loop then it can be removed from P and \(|P \cap (\mathcal {B},\mathcal {B})|\) will not increase. A path without loops goes through each vertex at most once. For \(\mathcal {B}\subset V\) there is at most \(|\mathcal {B}|-1\) edges in the path which have both endpoints in \(\mathcal {B}\).  □

We now let \(d^{\infty} = |\mathcal {B}|\) and redefine a valid labeling w.r.t. the new distance. A labeling d:V→{0,…,d } is valid in G if d(t)=0 and for all (u,v)∈E such that c f (u,v)>0:

(10)
(11)

Statement 6

A valid labeling d is a lower bound on the region distance \(d^{*\mathcal {B}}\).

Proof

If ut then \(d(u)\leq d^{*\mathcal {B}}\). Otherwise, let P=((u,v 1),…,(v l ,t)) be one of the shortest paths w.r.t. \(d^{*\mathcal {B}}\), i.e. \(d^{*\mathcal {B}}(u) = |P \cap (\mathcal {B},\mathcal {B})|\). Applying the validity property to each edge in this path, we have \(d(u) \leq d(t) + |P \cap (\mathcal {B},\mathcal {B})| = d^{*\mathcal {B}}(u)\). □

4.2 New Region Discharge

In this subsection, reachability relations “→”, “↛”, residual paths, and labeling validity will be understood in the region network G R or its residual \(G^{R}_{f}\).

The new Discharge operation, called Augmented path Region Discharge (ARD), works as follows. It first pushes excess to the sink along augmenting paths inside the network G R. When it is no longer possible, it continues to augment paths to vertices in the region boundary B R in the order of their increasing labels. This is represented by the sequence of nested sets T 0={t}, T 1={t}∪{vB Rd(v)<1}, …, \(T_{d^{\infty}} = \{t\}\cup \{v\in B^{R} \mid d(v)<d^{\infty}\}\). Set T i is the destination of augmentations in stage i. As we prove below, in stage i>0 residual paths may exist only to the set T i T i−1={vd(v)=i−1}.

The labels on the boundary \(d|_{B^{R}}\) remain fixed during Procedure 3 and the labels d| R inside the region do not participate in augmentations and therefore are updated only in the end.

Procedure 3
figure 9

ARD(G R,d)

Procedure 4
figure 10

Augment(X,Y)

We claim that Procedure 3 terminates with no active vertices inside the region, preserves validity and monotonicity of the labeling, and pushes flow from higher labels to lower labels w.r.t. the new labeling. These properties will be required to prove finite termination and correctness of S-ARD. Before we prove them (Statement 10) we need the following intermediate results:

  • Properties of the network \(G^{R}_{f}\) maintained by Procedure 3 (Statement 7, Corollaries 1 and 2).

  • Properties of valid labellings in the network \(G^{R}_{f}\) (Statement 8).

  • Properties of the labeling constructed by region-relabel (line 4 of Procedure 3) in the network \(G^{R}_{f}\) (Statement 9).

Lemma 1

Let X,YV R, XY=∅, XY. Then XY is preserved after (i) augmenting a path (x,…,v) with xX and vV R; (ii) augmenting a path (v,…,y) with yY and vV R.

Proof

(See Fig. 4(b).) Let \(\mathcal{X}\) be the set of vertices reachable from X. Let \(\mathcal{Y}\) be the set of vertices from which Y is reachable. Clearly \(\mathcal{X} \cap \mathcal{Y} = \emptyset\), otherwise XY. Therefore, the cut \((\mathcal{X}, \bar{\mathcal{X}})\) separates X and Y and has all edge capacities equal zero. Any residual path starting in X or ending in Y cannot cross the cut and its augmentation cannot change the edges of the cut. Hence, X and Y will stay separated. □

Statement 7

Let vV R and vT a in G f in the beginning of stage i 0 of Procedure 3, where a,i 0∈{0,1,…,d }. Then vT a holds until the end of Procedure 3, that is during all stages ii 0.

Proof

We need to show that vT a is not affected by augmentations performed by Procedure 3. If i 0a, we first prove vT a holds during stages i 0ia. Consider augmentation of a path (u 0,u 1,…,u l ), u 0R, u l T i T a , e f (u 0)>0. Assume vT a before augmentation. By Lemma 1 with X={v}, Y=T a (noting that XY and the augmenting path ends in Y), after the augmentation vT a . By induction, it holds till the end of stage a and hence in the beginning of stage a+1.

We can assume now that i 0>a. Let A={uRe f (u)>0}. At the end of stage i 0−1, we have \(A \nrightarrow T_{i_{0}-1}\supset T_{a}\) by construction. Consider augmentation in stage i 0 on a path (u 0,u 1,…,u l ), u 0R, \(u_{l} \in T_{i_{0}}\), e f (u 0)>0. By construction, u 0A. Assume {v}∪AT a before augmentation. Apply Lemma 1 with X={v}∪A, Y=T a (we have XY and u 0AX). After augmentation it is XT a . By induction, XT a till the end of stage i 0. By induction on stages, vT a until the end of the Procedure 3 procedure. □

Corollary 1

If wB R then wT d(w) throughout the Procedure  3 procedure.

Proof

At initialization, it is fulfilled by construction of G R due to c R(B R,R)=0. It holds then during Procedure 3 by Statement 7. □

In particular, we have B Rt during Procedure 3.

Corollary 2

Let (u,v 1v l ,w) be a residual path in \(G^{R}_{f}\) from uR to wB R and let v r B R for some r. Then d(v r )≤d(w).

Proof

We have \(v_{r} \nrightarrow T_{d(v_{r})}\). Suppose d(w)<d(v r ), then \(w\in T_{d(v_{r})}\) and because v r w it is \(v_{r} \rightarrow T_{d(v_{r})}\) which is a contradiction. □

The properties of the network \(G^{R}_{f}\) established by Statement 7 and Corollary 2 are illustrated in Fig. 5.

Fig. 5
figure 11

(a) Reachability relations in the network \(G^{R}_{f}\) at the end of stage 1 of ARD: {ve f (v)>0}↛T 1; \(T_{d^{\infty}}\backslash T_{1} \nrightarrow R\). (b) Example of a path in the network \(G^{R}_{f}\) for which by Corollary 2 it must be d(v)≤d(w). Note, such a path is not possible at the beginning of ARD, but in the middle it may exist since residual capacities of edges (B R,R) may become non-zero

Statement 8

Let d be a valid labeling, d(u)≥1, uR. Then uT d(u)−1.

Proof

Suppose uT 0. Then there exist a residual path (u,v 1,…,v l ,t), v i R (by Corollary 1 it cannot happen that v i B R). By validity of d we have d(u)≤d(v 1)≤…≤d(v l )≤d(t)=0, which is a contradiction.

Suppose d(u)>1 and uT d(u)−1. Because uT 0, it must be that uw, wB R and d(w)<d(u)−1. Let (v 0,…,v l ) be a residual path with v 0=u and v l =w. Let r be the minimal number such that v r B R. By validity of d we have d(u)≤d(v 1)≤…≤d(v r−1)≤d(v r )+1. By Corollary 2 we have d(v r )≤d(w), hence d(u)≤d(w)+1 which is a contradiction. □

Statement 9

For d computed on line 4 of Procedure 3 and any uR it holds:

  1. 1.

    d is valid;

  2. 2.

    uT a d(u)≥a+1.

Proof

1. Let (u,v)∈E R and c(u,v)>0. Clearly uv. Consider four cases:

  • case uR, vB R: Then uT d(v)+1, hence d(u)≤d(v)+1.

  • case uR, vR: If \(v\nrightarrow T_{d^{\infty}}\) then d(v)=d and d(u)≤d(v). If \(v\rightarrow T_{d^{\infty}}\), then d(v)=min{ivT i }. Let i=d(v), then vT i and uT i , therefore d(u)≤i=d(v).

  • case uB R, vR: By Corollary 1, uT d(u). Because uv, it is vT d(u), therefore d(v)≥d(u)+1 and d(u)≤d(v)−1≤d(v)+1.

  • case when u=t or v=t is trivial.

2. The “⇐” direction follows by Statement 8 applied to d, which is a valid labeling. The “⇒” direction: we have uT a and d(u)≥min{iuT i }=min{i>auT i }≥a+1. □

Statement 10

(Properties of ARD)

Let d be a valid labeling in G R. The output (f′,d′) of Procedure 3 satisfies:

  1. 1.

    There are no active vertices in R w.r.t. (f′,d′) (optimality),

  2. 2.

    d′≥d, \(d'|_{B^{R}} = d|_{B^{R}}\) (labeling monotonicity),

  3. 3.

    d′ is valid in \(G^{R}_{f'}\) (labeling validity),

  4. 4.

    f′ is a sum of path flows, where each path is from a vertex uR to a vertex v∈{t}∪B R and it is d′(u)>d(v) if vB R (flow direction).

Proof

  1. 1.

    In the last stage, the Procedure 3 procedure augments all paths to \(T_{d^{\infty}}\). After this augmentation a vertex uR either has excess 0 or there is no residual path to \(T_{d^{\infty}}\) and hence d′(u)=d by construction.

  2. 2.

    For d(u)=0, we trivially have d′(u)≥d(u). Let d(u)=a+1>0. By Statement 8, uT a in G R and it holds also in \(G^{R}_{f'}\) by Statement 7. From Statement 9.2, we conclude that d′(u)≥a+1=d(u). The equality \(d'|_{B^{R}}=d|_{B^{R}}\) is by construction.

  3. 3.

    Proven by Statement 9.1.

  4. 4.

    Consider a path from u to vB R, augmented in stage i>0. It follows that i=d(v)+1. At the beginning of stage i, it is uT i−1. By Statement 7, this is preserved till the end of Procedure 3. By Statement 9.2, d′(u)≥i=d(v)+1>d(v).

 □

Algorithms 1 and 2 for Discharge being Procedure 3 will be referred to as S-ARD and P-ARD, respectively.

4.3 Complexity of Sequential Augmented path Region Discharging

Statement 2 holds for S-ARD as well, so S-ARD maintains a valid labeling.

Theorem 3

S-ARD terminates in at most \(2|\mathcal {B}|^{2}+1\) sweeps.

Proof

The value of d(v) does not exceed \(|\mathcal {B}|\) and d is non-decreasing. The total increase of \(d|_{\mathcal {B}}\) during the algorithm is at most \(|\mathcal {B}|^{2}\).

After the first sweep, active vertices are only in \(\mathcal {B}\). Indeed, discharging region R k makes all vertices vR k inactive and only vertices in \(\mathcal {B}\) may become active. So by the end of the sweep, all vertices \(V\backslash \mathcal {B}\) are inactive.

Therefore, after the first sweep, the potential can be equivalently written as

(12)

We will prove the following two cases for each sweep but the first one:

  1. 1.

    If \(d|_{\mathcal {B}}\) is increased then the increase in Φ is no more than total increase in \(d|_{\mathcal {B}}\). Consider discharge of R k. Let Φ be the value before ARD on R k and Φ′ the value after. Let Φ′=d′(v). It must be that v is active in G′. If vV k, then d(v)=d′(v) and e(v)=e f(v) so Φd(v)=Φ′.

    Let vV k. After the discharge, vertices in R k are inactive, so vB k and it is d′(v)=d(v). If v was active in G then Φd(v) and we have Φ′−Φd′(v)−d(v)=0. If v was not active in G, there must exist an augmenting path from a vertex v 0 to v such that \(v_{0}\in R^{k}\cap \mathcal {B}\) was active in G. For this path, the flow direction property implies d′(v 0)≥d(v). We now have \(\varPhi'-\varPhi \leq d'(v)-d(v_{0}) = d(v)-d(v_{0}) \leq d'(v_{0})-d(v_{0}) \leq \sum_{v\in R^{k} \cap \mathcal {B}}[d'(v)-d(v)]\). Summing over all regions, we get the result.

  2. 2.

    If \(d|_{\mathcal {B}}\) is not increased then Φ is decreased at least by 1. We have d′=d. Let us consider the set of vertices having the highest active label or above, H={vd(v)≥Φ}. These vertices do not receive flow during all discharge operations due to the flow direction property. After the discharge of R k there are no active vertices left in R kH (Statement 10.1). After the full sweep, there are no active vertices in H.

In the worst case, starting from sweep 2, Φ can increase by one \(|\mathcal {B}|^{2}\) times and decrease by one \(|\mathcal {B}^{2}|\) times. There are no active vertices left in at most \(2 |\mathcal {B}|^{2}+1\) sweeps. □

On termination, we have that the labeling is valid and there are no active vertices in G. Therefore, the accumulated preflow is maximal and a minimum cut can be found by analyzing the reachability in G (see discussion for S-PRD Sect. 3.3 and implementation details Sect. 5.2).

4.4 Complexity of Parallel Augmented Path Region Discharging

Statement 11

(Properties of Parallel ARD)

Let d be a valid labeling at the beginning of a sweep of P-ARD. Then the pair of fused flow and labeling (f′,d′) satisfies:

  1. 1.

    Vertices in \(V\backslash \mathcal {B}\) are not active in G f (optimality),

  2. 2.

    d′≥d (labeling monotony),

  3. 3.

    d′ is valid (labeling validity),

  4. 4.

    f′ is the sum of path flows, where each path is from a vertex uV to a vertex \(v\in \mathcal {B}\), satisfying d′(u)≥d(v) (weak flow direction).

Proof

  1. 1.

    For each k there are no active vertices in R k w.r.t. \((f'_{k}, d'_{k})\). The fused flow f′ may differ from \(f'_{k}\) only on the boundary edges \((u,v)\in (\mathcal {B},\mathcal {B})\). So there are no active vertices in \(V\backslash \mathcal {B}\) w.r.t. (f′,d′).

  2. 2.

    By construction.

  3. 3.

    Same as in P-PRD.

  4. 4.

    Consider the augmentation of a path from uR k to vB k during ARD on G k and canceling of the flow on the last edge of the path during the flow fusion step. Let the last edge of the path be (w,v). We need to prove that d′(u)≥d(w). Let \(\tilde{d}\) be the labeling in G k right before augmentation, as if it was computed by region-relabel. Because \(\tilde{d}\) is valid it must be that \(\tilde{d}(w)\leq \tilde{d}(v)+1\). We have \(d'_{k}(u) > d(v) = \tilde{d}(v) \geq \tilde{d}(w)-1 \geq d(w)-1\). So d′(u)≥d(w).

 □

Theorem 4

P-ARD terminates in \(2|\mathcal {B}|^{2}+1\) sweeps.

Proof

As before, total increase of \(d|_{\mathcal {B}}\) is at most \(|\mathcal {B}|^{2}\).

After the first sweep, active vertices are only in \(\mathcal {B}\) by Statement 11.1.

For each sweep after the first one:

  • If \(d|_{\mathcal {B}}\) is increased then increase in Φ is no more than the total increase of \(d|_{\mathcal {B}}\). Let Φ′ be the value of the potential in the network G′=G f. Let Φ′=d′(v). It must be that v is active in G′ and \(v\in \mathcal {B}\).

    If v was active in G then Φd(v) and we have Φ′−Φd′(v)−d(v).

    If v was not active in G then there must exist a path flow in f′ from a vertex v 0 to v such that \(v_{0}\in \mathcal {B}\) was active in G. For this path, the weak flow direction property implies d′(v 0)≥d(v). We have \(\varPhi'-\varPhi \leq d'(v)-d(v_{0}) = d'(v)-d(v)+d(v)-d(v_{0}) \leq d'(v)-d(v)+d'(v_{0})-d(v_{0}) \leq \sum_{v\in \mathcal {B}}[d'(v)-d(v)]\).

  • If \(d|_{\mathcal {B}}\) is not increased then Φ is decreased at least by 1. In this case, f′ satisfies the strong flow direction property and the proof of Theorem 3 applies.

After total of \(2 |\mathcal {B}|^{2}+1\) sweeps, there are no active vertices left. □

5 Implementation

In this section, we first discuss heuristics for improving the distance labeling (making it closer to the true distance at a cheap cost) commonly used in the push-relabel framework. They are essential for the practical performance of the algorithms. We then describe our base implementations of S-ARD/S-PRD and the solvers they rely on. In the next section, we describe an efficient implementation of ARD, which is more sophisticated but has a much better practical performance. All of the labeling heuristics can only increase the labels and preserve validity of the labeling. Therefore, they do not break theoretical properties of the respective algorithms.

5.1 Heuristics

Region-relabel heuristic computes labels d| R of the region vertices, given the distance estimate on the boundary, \(d|_{B^{R}}\). There is a slight difference between PRD and ARD variants (using distance d and \(d^{*\mathcal {B}}\), resp.), displayed by the corresponding “if” conditions (Procedure 5).

Procedure 5
figure 12

Region-relabel(G R,\(d|_{B^{R}}\))

In the implementation, the set of boundary vertices is sorted in advance, so that Region-relabel runs in O(|E R|+|V R|+|B R|log|B R|) time and uses O(|V R|) space. The resulting labeling d′ is valid and satisfies d′≥d for arbitrary valid d.

Global Gap Heuristic

Let us briefly explain the global gap heuristic (Cherkassky and Goldberg 1994). It is a sufficient condition to identify that the sink is unreachable from a set of vertices. Let there be no vertices with label g>0: ∀vV d(v)≠g, and let d(u)>g. For a valid labeling d, it follows that there is no vertex v for which c(u,v)>0 and d(v)<g. Assuming there is, we will have d(u)≤d(v)+1≤g, which is a contradiction. Therefore the sink is unreachable from all vertices {ud(u)>g} and their labels may be set to d .

Region gap heuristic of Delong and Boykov (2008) detects if there are no vertices inside region R having label g>0. Such vertices can be connected to the sink in the whole network only through one of the boundary vertices, so they may be relabeled up to the closest boundary label. Here is the algorithm (Procedure 6)Footnote 4

Procedure 6
figure 13

d| R = Region-gap(G R,\(d|_{R\cup B^{R}},g\))

If no boundary vertex is above the gap, then d next=d in step 1 and all vertices above the gap are disconnected from the sink in the network G. Interestingly, this sufficient condition does not imply a global gap. In our implementation of PRD, we detect the region-gap efficiently after every vertex relabel operation by discovering an empty bucket (see the implementation of S/P-PRD in Sect. 5.5).

5.2 Reachability/Exact Distance Computation

At the termination of our algorithms (S/P-PRD, S/P-ARD), we have found a maximum preflow and we know that \((\bar{T},T)\), defined by T={vvt in G}, is a minimum cut. However, we only know the lower bound d on the true distance d (resp. \(d^{*\mathcal {B}}\)) and therefore the reachability relation vt is not fully known at this point. When d(v)=d we are sure that vt in G and hence v must be in the source set of a minimum cut, but if d(v)<d it is still possible that vt in G. Therefore, we need to do some extra work to make d the exact distance and in this way to find the minimum cut. For this purpose we execute several extra sweeps, performing only region-relabel and gap heuristics until labels stop changing. We claim that at most d such extra sweeps are needed. We give a proof for the case of push-relabel distance.

Proof

Let us call labels d(v) loose if d(v)<d (v) and exact if d(v)=d (v). Consider the lowest loose label, L=min{d(v)∣d(v)<d (v)} and the set of loose vertices having this label, \(\mathcal {L}= \{v \mid L=d(v)<d^{*}(v) \}\). Let us show that after a sweep of region-relabel, the value of L increases at least by 1. Let \(v\in \mathcal {L}\), (v,w)∈E and c(v,w)>0. If d(w) is loose, we have d(w)≥L by construction. Assume that d(w) is exact. Since d(v)<d (v) and d (v)≤d (w)+1, we have d(w)≥d(v)=L. Therefore, all neighbors of v have label L or above. After the elementary Relabel of v or Region-relabel of the region including v, its label will increase at least by 1 (recall that Relabel of v performs d(v):=min w {d(w)∣(v,w)∈E, c(v,w)>0}+1). Because this holds for all vertices from \(\mathcal {L}\), the value L will increase at least by 1 after elementary Relabel of all vertices or a sweep of Region-relabel. Because L is bounded above by d , after at most d sweeps, d will be the exact distance. □

This proof can be suitably modified for the case of region distance (used in ARD) by replacing the pair (v,w) with a path from v to a boundary vertex w. In this case, we have the bound \(d^{\infty} = |\mathcal {B}|\) sweeps. In the experiments, we observed that in order to compute the exact distance, only few extra sweeps were necessary (from 0 to 2) for S/P-ARD and somewhat more for S/P-PRD. Note, to compute the final reachability relation in S/P-PRD, the region distance and ARD Region-relabel could be employed. However, we did not implement this improvement. In Sect. 6 we describe how ARD Region-relabel is replaced by a dynamic data structure (search trees), allowing for quick recomputation during the sweeps.

5.3 Referenced Implementations

Boykov-Kolmogorov (BK)

The reference augmenting path implementation by Boykov and Kolmogorov (2004) (v3.0, http://www.cs.adastral.ucl.ac.uk/~vnk/software.html). We will also use the possibility of dynamic updates in this code due to Kohli and Torr (2005). There is only a trivial O(mn 2|C|) complexity bound known for this algorithm,Footnote 5 where C is the cost of a minimum cut.

Highest level Push-Relabel (HIPR)

The reference push-relabel implementation by Cherkassky and Goldberg (1994) (v3.6, http://www.avglab.com/andrew/soft.html). This implementation has two stages: finding the maximum preflow/minimum cut and upgrading the maximum preflow to a maximum flow. Only the first stage was executed and benchmarked. We tested two variants with frequency of the global relabel heuristic (the frequency parameter roughly corresponds to the proportion of time spent on global updates versus push/relabel) equal to 0.5 (the default value in HIPR v3.6) and equal to 0. These variants will be denoted HIPR0.5 and HIPR0 respectively. HIPR0 executes only one global update at the beginning. Global updates are essential for difficult problems. However, HIPR0 was always faster than HIPR0.5 in our experiments with real test instances.Footnote 6 The worst case complexity is \(O(n^{2}\sqrt{m})\).

5.4 S/P-ARD Implementation

The basic implementation of ARD simply invokes BK solver as follows. On stage 0 we compute the maximum flow in the network G R by BK, augmenting paths from source to the sink. On the stage i, infinite capacities are added from the boundary vertices having label i−1 to the sink, using the possibility of dynamic changes in BK. The flow augmentation to the sink is then continued, reusing the search trees. The Region-relabel procedure is implemented as described earlier in this section. In the beginning of next discharge, we clear the infinite link from boundary to the sink and repeat the above. Some parts of the sink search tree, linked through the boundary vertices, get destroyed, but the larger part of it and the source search tree are reused. A more efficient implementation is described in Sect. 6. It includes additional heuristics and maintenance of separate boundary search trees.

S-ARD

In the streaming mode, we keep only one region in the memory at a time. After a region is processed by ARD, all the internal data structures have to be saved to the disk and cleared from memory until the region is discharged next time. We manage this by allocating all the region’s data into a fixed page in the memory, which can be saved and loaded preserving the pointers. By doing the load/unload manually (rather than relying on the system swapping mechanism), we can accurately measure the pure time needed for computation (CPU) and the amount of disk I/O. We also can use 32 bit pointers with larger problems.

A region with no active vertices is skipped. The global gap heuristic is executed after each region discharge. Because it is based on labels of boundary vertices only, it is sufficient to maintain a label histogram with \(|\mathcal {B}|\) bins to implement it. S-ARD uses \(O(|\mathcal {B}| + |(\mathcal {B},\mathcal {B})|)\) “shared” memory and O(|V R+E R|) “region” memory, to which regions are loaded one at a time.

To solve large problems, which do not fit in the memory, we have to create region graphs without ever loading the full problem. We implemented a tool called splitter, which reads the problem from a file and writes edges corresponding to the same region to the region’s separate “part” file. Only the boundary edges (linking different regions) are withheld to the memory.

P-ARD

We implemented this algorithm for a shared-memory system using OpenMP language extension. All regions are kept in the memory, the discharges are executed concurrently in separate threads, while the gap heuristic and messages exchange are executed synchronously by the master thread.

5.5 S/P-PRD Implementation

To solve region discharge subproblems in PRD in the highest label first fashion, we designed a special reimplementation of HIPR, which will be denoted HPR. We intended to use the original HIPR implementation to make sure that PRD relies on the state-of-the art core solver. It was not possible directly. A subproblem in PRD is given by a region network with fixed distance labels on the boundary (let us call them seeds). Distance labels in PRD may go up to n in the worst case. The same applies to region subproblems as well. Therefore, keeping an array of buckets corresponding to possible labels (like in HIPR), would not be efficient. It would require O(|V|) memory and an increased complexity. However, because a region has only |V R| vertices, there are no more than |V R| distinct labels at any time. This allows to keep buckets as a doubly-linked list with at most |V R| entries. Highest label selection rule and the region-gap heuristic can then be implemented efficiently with just a small overhead. We tried to keep other details similar to HIPR (current arc data structure, etc.). HPR with arbitrary seeds has the worst case complexity \(O(|V^{R}|^{2} \sqrt{|E^{R}|})\) and uses O(|V R|+|V E|) space. When the whole problem is taken as a single region, HPR should be equivalent to HIPR0. Though the running time on the real instances can be somewhat different.

S-PRD

This is our reimplementation of the algorithm by Delong and Boykov (2008) for an arbitrary graph and a fixed partition, using HPR as a core solver. It uses the same memory model, paging mechanism and the splitter tool as S-ARD. The region discharge is always warm-started. We found it inefficient to run the region-relabel after every discharge. In the current experiments, motivated by performance of HIPR0, we run it once at the beginning and then only when a global gap is discovered. To detect a global gap, we keep a histogram of all labels, O(n) memory, and update it after each region discharge (in O(|V R|) time). In practice, this O(n) memory is not a serious limitation—labels are usually well below n. If they are not then we should consider a weaker gap heuristic with a smaller number of bins. Applying the gap (raising the corresponding vertices to d ) for all regions is delayed until they are loaded. So we keep the track of the best global gap detected for every region. Similar to how the sequential Algorithm 1 represents both S-ARD and S-PRD, it constitutes a piece of generic code in our implementation, where the respective discharge procedure and gap heuristics are plugged.

P-PRD

This is our implementation of parallel PRD for shared-memory system with OpenMP.

6 Efficient ARD Implementation

The basic implementation of S-ARD, as described in the previous section, worked reasonably fast (comparable to BK) on simple problems like 2D stereo and 2D random segmentation (Sect. 7.1). However, on some 3D problems the performance was unexpectedly bad. For example, to solve LB07-bunny-lrg instance (Sect. 7.2) the basic implementation required 32 minutes of CPU time. In this section we describe an efficient implementation which is more robust and is comparable in speed with BK on all tested instances. In particular, to solve LB07-bunny-lrg it takes only 15 seconds of CPU time. The problem why the basic implementation is so slow is in the nature of the algorithm: sometimes it has to augment the flow to the boundary, without knowing of whether it is a useful work or not. If the particular boundary was selected wrongly, the work is wasted. This happens in LB07-bunny-lrg instance, where the data seeds are sparse. A huge work is performed to push the flow around in the first few iterations, before a reasonable labeling is established. We introduce two heuristics how to overcome this problem: the boundary-relabel heuristic and partial discharges. An additional speed-up is obtained by dynamically maintaining boundary search trees and the current labeling.

6.1 Boundary Relabel Heuristic

We would like to have a better distance estimate, but we cannot run a global relabel because implementing it in a distributed fashion would take several full sweeps, which would be too wasteful. Instead, we go for the following cheaper lower bound. Our implementation keeps all the boundary edges (including their flow and distance labels of the adjacent vertices) in the shared memory. Figure 6(a) illustrates this boundary information. We want to improve the labels by analyzing only this boundary part of the graph, not looking inside the regions. Since we do not know how the vertices are connected inside the regions, we have to assume that every boundary vertex might be connected to any other one within the region, except of the following case. If u and v are in the same region R and d(u)>d(v) then we know for sure that uv in G R. It follows from the validity of labeling d (as defined for ARD in Sect. 4). We can calculate now a lower bound on the distance \(d^{*\mathcal {B}}\) in G assuming that all the rest of the vertices are potentially connected within the regions.

Fig. 6
figure 14

Boundary relabel heuristic: (a) Boundary vertices of the network and a valid labeling. Directed arcs correspond to non-zero residual capacities. Vertices without numbers have label d and do not participate in the construction. (b) Vertices having the same label are grouped together within each region and arcs of zero length (of red color) are added from a group to the next label’s group. It is guaranteed that e.g., vertices with label 1 are not reachable from vertices with label 2 within the region, hence there is no arc 2→1. Black arcs have the unit length. (c) The distance in the auxiliary graph is a valid labeling and a lower bound on the distance in the original network

We will now construct an auxiliary directed graph \(\bar{G}\) with arcs having length 0 or 1 and show that the distance in this graph (according to the arc lengths) lower bounds \(d^{*\mathcal {B}}\). If d(v)=d(u) we have to assume that vu and uv in G R, therefore the new lower bound for u and v will coincide. Hence we group vertices having the same label within a region together as shown in Fig. 6(b). In the case d(v)<d(u), we know that uv but have to assume vu in R. We thus add a directed arc of length zero from the group of v to the group of u (Fig. 6(b)). Let d 1<d 2<d 3 be labels of groups within one region. There is no need to create an arc from d 1 to d 3, because two arcs from d 1 to d 2 and from d 2 to d 3 of length zero are an equivalent representation. Therefore it is sufficient to connect only groups having consecutive labels. We then add all residual edges (u,v) between the regions to \(\bar{G}\) with length 1. We can calculate the distance to vertices with label 0 in \(\bar{G}\) by running Dijkstra’s algorithm. Let this distance be denoted d′. We then update the labels as

$$ d(u) := \max\bigl\{d(u), d'(u)\bigr\}. $$
(13)

We have to prove the following two points:

  1. 1.

    d′ is a valid labeling;

  2. 2.

    If d and d′ are valid labellings, then max(d,d′) is valid.

Proof

1. Let c(u,v)>0. Let u and v be in the same region. It must be that d(u)≤d(v). Therefore either u and v are in the same group or there is an arc of length zero from group of u to group of v. It must be d′(u)≤d′(v) in any case. If u and v are in different regions, there is an arc of length 1 from group of u to group of v and therefore d′(u)≤d′(v)+1.

2. Let l(u,v)=1 if \((u,v)\in (\mathcal {B},\mathcal {B})\) and l(u,v)=0 otherwise. We have to prove that if c(u,v)>0 then

$$ \max\bigl\{d(u),d'(u)\bigr\} \leq \max \bigl\{d(v),d'(v)\bigr\} + l(u,v). $$
(14)

Let max{d(u),d′(u)}=d(u). From validity of d we have d(u)≤d(v)+l(u,v). If d(v)≥d′(v), then max{d(v),d′(v)}=d(v) and (14) holds. If d(v)<d′(v) then d(u)≤d(v)+l(u,v)<d′(v)+l(u,v) and (14) holds again. □

The complexity of boundary relabel heuristic is \(O(|(\mathcal {B},\mathcal {B})|)\). It is relatively inexpensive and can be run after each sweep. It does not affect the correctness and the worst case bound on the number of sweeps of S/P-ARD.

6.2 Partial Discharge

Another heuristic which proved very efficient was simply to postpone path augmentations to higher boundary vertices to further sweeps. This allows to save a lot of unnecessary work, especially when used in combination with boundary-relabel. More precisely, on sweep s the ARD procedure is allowed to execute only stages up to s. This way, in sweep 0 only paths to the sink are augmented and not any path to the boundary. Vertices which cannot reach the sink (but can potentially reach the boundary) get label 1. These initial labels may already be improved by boundary-relabel. In sweep 1 paths to the boundary with label 0 are allowed to be augmented and so on.

Note that this heuristic does not affect the worst case complexity of S/P-ARD. Because labels can grow only up to \(|\mathcal {B}|\), after at most \(|\mathcal {B}|\) sweeps the heuristic turns into full discharge. Therefore, the worst case bound of O(\(|\mathcal {B}^{2}|\)) sweeps remains valid. In practice, we found it to increase the number of sweeps slightly, while significantly reducing the total computation time. Similarly, in the case of push-relabel, it would make sense to perform several sweeps of Region-relabel before doing any pushes to get a better estimate of the distance.

6.3 Boundary Search Trees

We now redesign the implementation of ARD such that not only the sink and source search trees are maintained but also the search trees of boundary vertices. This allows to save computation when the labeling of many boundary vertices remains constant during the consequent sweeps, with only a small fraction changing. Additionally, knowing the search tree for each inner vertex of the region determines its actual label, so the region-relabel procedure becomes obsolete. The design of the search tree data structures, their updates and other detail are the same as proposed by Kolmogorov (2004), only few changes to the implementation are necessary. For each vertex vR, we introduce a mark \(\tilde{d}(v)\) which corresponds to the root label of its tree or is set to a special free mark if v is not in any tree. For each tree we keep a list of open vertices (called active by Kolmogorov (2004)). A vertex is open if it is not blocked by the vertices of the trees with the same or lower root label (more precisely, v is open if it is not free and there is a residual edge (u,v) such that u is free or its root label is higher than that of v). The trees may grow only at the open vertices.

Figure 7 shows the correspondence between search trees and the labels. The sink search tree is assigned label −1. In the stage 0 of ARD, we grow the sink tree and augment all found paths if the sink tree touches the source search tree. Vertices, which are added to the sink tree are marked with label \(\tilde{d} = -1\). In stage i+1 of ARD, we grow trees with root at a boundary vertices w with label d(w)=i, all vertices added to the tree are marked with \(\tilde{d} = i\). When the tree touches the source search tree, the found path is augmented. If the tree touches a vertex u with label \(\tilde{d}(u) < i\), it means that u is already in the search tree with a lower root and no action is taken. It cannot happen that a vertex is reached with label \(\tilde{d}>i\) during growth of a search tree with root label i, this would contradict to the properties of ARD. The actual label of a vertex v at any time is determined as \(\tilde{d}(v)+1\) if vR and d(v) if vB R.

Fig. 7
figure 15

Search trees. (a) A region with some residual arcs. The region has only 3 boundary vertices, for simplicity, the numbers correspond to the labels. (b) Search trees of the sink and boundary vertices: when a vertex can be reached by several trees, it choses the one with the lowest label of the root. The sink is assigned a special label −1. The source search tree is empty in this example. (c) Labels of the inner vertices are determined as their tree’s root label +1

Let us now consider the situation in which region R has built some search trees and the label of a boundary vertex w is risen from d(w) to d′(w) (as a result of update from the neighboring region or one of the heuristics). All the vertices in the search tree starting from w were previously marked with d(w) and have to be declared as free vertices or adopted to any other valid tree with root label d(w). The adaptation is performed by the same mechanism as in BK. The situation when a preflow is injected from the neighboring region and (a part of) a search tree becomes disconnected is also handled by the orphan adaptation mechanism.

The combination of the above improvements allows S-ARD to run in about the same time as BK on all tested vision instance (Sect. 7.2), sometimes being even significantly faster (154 s vs. 245 s on BL06-gargoyle-lrg).

7 Experiments

All experiments were conducted on a system with Intel Core 2 Quad CPU@2.66 Hz, 4 GB memory, Windows XP 32 bit and Microsoft VC compiler. Our implementation and instructions needed to reproduce the experiments can be found at http://cmp.felk.cvut.cz/~shekhovt/d_maxflow. We conducted 3 series of experiments:

  • Synthetic experiments, where we observe general dependencies of the algorithms, with some statistical significance, i.e. not being biased to a particular problem instance. It also serves as an empirical validation, as thousands of instances are solved. Here, the basic implementation of S-ARD was used.

  • Sequential competition. We study sequential versions of the algorithms, running them on real vision instances. Only a single core of the CPU is utilized. We fix the region partition and study how much disk I/O it would take to solve each problem when only one region can be loaded in the memory at a time. In this and the next experiment we used the efficient implementation of ARD. Note, in the preceding publication (Shekhovtsov and Hlavac 2011) we reported worse results with the earlier implementation.

  • Parallel competition. Parallel algorithms are tested on the instances which can fully fit in 2 GB of memory. All 4 cores of the CPU are allowed to be used. We compare our algorithms with two other state-of-the-art distributed implementations.

7.1 General Dependences: Synthetic Problems

We generated simple synthetic problems to validate the algorithms. The network is constructed as a 2D grid with a regular connectivity structure. Figure 8(a) shows an example of such a network. The edges are added to the vertices at the following relative displacements (0,1), (1,0), (1,2), (2,1), (1,3), (3,1), (2,3), (3,2), (0,2), (2,0), (2,2), (3,3), (3,4), (4,2). By connectivity we mean the number of edges incident to a vertex far enough from the boundary. Adding pairs (0,1), (1,0) results in connectivity 4 and so on. Each vertex is given an integer excess/deficit distributed uniformly in the interval [−500 500]. A positive number means a source link and a negative number a sink link. All edges in the graph are assigned a constant capacity, called strength. The network is partitioned into regions by slicing it in s equal parts in both dimensions. Thus we have 4 parameters: the number of vertices, the connectivity, the strength and the number of regions. We generate 100 instances for each value of the parameters.

Fig. 8
figure 16

(a) Example of a synthetic problem: a network of the size 6×6, connectivity 8, partitioned into 4 regions. The source and sink are not shown. (b) Dependence on the interaction strength, for size 1000×1000, connectivity 8 and 4 regions. Plots show mean values and intervals containing 70 % of the samples

Let us first look at the dependence on the strength shown in Fig. 8(b). Problems with small strength are easy, because they are very local—long augmentation paths do not occur. On the other hand, long paths needs to be augmented for problems with large strength. However, finding them is easy because bottlenecks are unlikely. Therefore BK and S-ARD have a maximum in the computation time somewhere in the middle. It is more difficult to transfer the flow over long distances for push-relabel algorithms. This is where the global relabel heuristic becomes efficient and HIPR0.5 outperforms HIPR0. The region-relabel heuristic of S-PRD allows it to outperform other push-relabel variants.

In general, we think all such random 2D networks are too easy. Nevertheless, they are useful and instructive to show basic dependences. We now select the “difficult” point for BK with the strength 150 and study other dependencies:

  • The number of regions (Fig. 9). For this problem family, both the number of sweeps and the computation time grows slowly with the number of regions.

    Fig. 9
    figure 17

    Dependence on the number of regions, for size 1000×1000, connectivity 8, strength 150

  • The problem size (Fig. 10). Computation efforts of all algorithms grow proportionally. However, the number of sweeps shows different asymptotes. It is almost constant for S-ARD but grows significantly for S-PRD.

    Fig. 10
    figure 18

    Dependence on the problem size, for connectivity 8, strength 150, 4 regions

  • Connectivity (Fig. 11). Connectivity is not independent of the strength. Roughly, 4 edges with capacity 100 can transmit as much flow as 8 edges with capacity 50. Therefore while increasing the connectivity we also decrease the strength as 150⋅8 divided by connectivity in this plot.

    Fig. 11
    figure 19

    Dependence on the connectivity, for size 1000×1000, strength=(150×8)/connectivity, 4 regions

  • Workload (Fig. 12). This plot shows how much time each of the algorithms spends performing different parts of computation. Note that the problems are solved on a single computer with all regions kept in memory, therefor the time on sending messages should be understood as updates of dynamic data structure of the region w.r.t. the new labeling and flow on the boundary. For S-PRD more sweeps are needed, so the total time spent in messages and gap heuristic is increased. Additionally, the gap heuristic has to take into account all vertices, unlike only the boundary vertices in S-ARD.

    Fig. 12
    figure 20

    Workload distribution, for size 1000×1000, connectivity 8, 4 regions, strength 150. msg—passing the messages (updating flow and labels on the boundary), discharge—work done by the core solver (BK for S-ARD and HPR for S-PRD), relabel—the region-relabel operation, gap—the global gap heuristic

7.2 Sequential Competition

We tested our algorithms on the maxflow problem instances in computer vision University of Western Ontario web pages (2008). The data consist of typical max-flow problems in computer vision, graphics, and biomedical image analysis. Stereo instances are sequences of subproblems (arising in the expansion move algorithm) for which the total time should be reported. There are two models: BVZ (Boykov et al. 1998), in which the graph is a 4-connected 2D grid, and KZ2 (Kolmogorov and Zabih 2001), in which there are additional long-range links. Multiview 3D reconstruction models LB06 (Lempitsky et al. 2006) and BL06 (Boykov and Lempitsky 2006). Graphs of these problems are cellular complexes subdividing the space into 3D cubes and each cube into 24 smaller cells. Surface fitting instances LB07 (Lempitsky and Boykov 2007) are 6-connected 3D grid graphs. And finally, there is a collection of volumetric segmentation instances BJ01 (Boykov and Jolly 2001), BF06 (Boykov and Funka-Lea 2006), BK03 (Boykov and Kolmogorov 2003) with 6-connected and 26-connected 3D grid graphs.

To test our streaming algorithms, we used the regulargrid hint available in the definition of the problems to select the regions by slicing the problem into 4 parts in each dimension—into 16 regions for 2D BVZ grids and into 64 regions for 3D segmentation instances. Problems KZ2 do not have such a hint (they are not regular grids), so we sliced them into 16 pieces just by the vertex number. The same we did for the multiview LB06 instances. Though they have a size hint, we failed to interpret the vertex layout correctly (the separator set, \(\mathcal {B}\), was unexpectedly large when trying to slice along the dimensions). So we sliced them purely by the vertex number.

One of the problems we faced is pairing the arcs which are reverse of each other. While in stereo, surface and multiview problems, the reverse arcs are consequent in the files, and can be easily paired, in 3D segmentation they are not. For a generic algorithm, not being aware of the problem’s regularity structure, it is actually a non-trivial problem requiring at least the memory to read all of the arcs first. Because our goal is a relative comparison, we did not pair the arcs in 3D segmentation. This means we kept twice as many arcs than necessary for those problems. This is seen in Table 1, e.g. for babyface.n26c100, which is 26-connected, but we construct a multigraph (has parallel arcs) with average vertex degree of 49. For some other instances, however, this is not visible, because there could be many zero arcs, e.g. liver.n26c10 which is a 26-connected grid too, but has the average vertex degree of 10.4 with unpaired arcs. The comparison among different methods is correct, since all of them are given exactly the same multigraph.

Table 1 Sequential Competition. CPU—the time spent purely for computation, excluding the time for parsing, construction and disk I/O. The total time to solve the problem is not shown. K—number of regions. RAM—memory taken by the solver; for BK in the case it exceeds 2 GB limit, the expected required memory; for streaming solvers the sum of shared and region memory. I/O—total bytes read or written to the disk

The results are presented in Table 1. We did measure the real time of disk I/O. However, it depends on the hard drive performance, other concurrently running processes as well as on the system file caching (which has effect for small problems). We therefore report total bytes written/loaded and give an estimate of the full running time for the disk speed of 100 MB/s (see Table 2). Note that disk I/O is not proportional to the number of sweeps, because some regions may be inactive during a sweep and thus skipped. For HIPR we do not monitor the memory usage. It is slightly higher than that of HPR, because of keeping initial arc capacities.

Table 2 Estimated running time for the algorithms in the streaming mode, including the time for Disk I/O. The estimate is computed for a disk speed of 100 MB/s and does not include initial problem splitting. The table also gives the total amount of memory used by the solver

For verification of solvers, we compared the flow values to the ground truth solution provided in the dataset. Additionally, we saved the cut output from each solver and checked its cost independently. Verifying the cost of the cut is relatively easy: the cut can be kept in memory and the edges can be processed form the DIMACS problem definition file on-line. An independent check of (pre-)flow feasibility would be necessary for full verification of a solver. However, it would require storing the full graph in the memory and was not implemented.

Our new algorithms computed flow values for all problems matching those provided in the dataset, except for the following cases:

  • LB07-bunny-lrg: no ground truth solution available (we found flow/cut of cost 15537565).

  • babyfacen26c10 and babyfacen26c100: we found higher flow values than those which were provided in the dataset (we found flow/cut of cost 180946 and 1990729 resp.).

The latter problems appear to be the most difficult for S-ARD in terms of both time and number of sweeps. Despite this, S-ARD requires much fewer sweeps, and consequently much less disk I/O operations than the push-relabel variant. This means that in the streaming mode, where read and write operations take a lot of time, S-ARD is clearly superior. Additionally, we observe that the time it spends for computation is comparable to that of BK, sometimes even significantly smaller.

Next, we studied the dependency of computation time and number of sweeps on the number of regions in the partition. We selected 3 representative instances of different problems and varied the number of regions in the partition. The results are presented in the Fig. 13. The instance BL06-gargoyle-sml was partitioned by the vertex index and the remaining two problems were partitioned according to their 3D vertex layout, using variable number of slices in each dimension. The results demonstrate that the computation time required to solve these problems is stable over a large range of partitions and the number of sweeps required does not grow rapidly. Therefore, for the best practical performance the partition for S-ARD can be selected to meet other requirements: memory consumption, number of computation units, etc. We should note however, that with refining the partition the amount of shared memory grows proportionally to the number of boundary edges. In the limit of single-vertex regions, the algorithm will turn into a very inefficient implementation of pure push-relabel.

Fig. 13
figure 21

Dependence on the number of regions for the representative instances of multiview, stereo and segmentation. Top: CPU time used. Bottom: number of sweeps

7.3 Parallel Competition

In this section, we test parallel versions of our algorithms and compare them with two state-of-the-art methods. The experiments are conducted on the same machine as above (Intel Core 2 Quad CPU@2.66 Hz) but allowing the use of all 4 CPUs. The goal is to see how the distributed algorithms perform in the simplified setting when they are run not in the network but on a single machine. For P-ARD/PRD we expect that the total required work would increase compared to the sequential versions because the discharges are executed concurrently. The relative speed-up therefore would be sublinear even if we managed to distribute the work between CPUs evenly. The tests are conducted on small and medium size problems (taking under 2 GB of memory). For P-ARD and P-PRD we use the same partition into regions as in Table 1. For other solvers, discussed next, we tried to meet better their requirements.

DD

The integer dual decomposition algorithm by Strandmark and Kahl (2010)Footnote 7 uses adaptive vertex-wise step rule and randomization. With or without randomization, this algorithm is not guaranteed to terminate. However, while without randomization there is an example with 4 vertices such that the algorithm never terminates, with randomization there is always a chance that it does. Interestingly, on all of the stereo problems the algorithm terminated in a small number of iterations. However, on larger problems partitioned into 4 regions it exceeded the internal iterations bound (1000) in many cases and returned without optimal flow/cut. In such a case it provides only an approximate solution to the problem. Whether such a solution is of practical value is beyond us. We tested it with partitions into 2 and 4 regions (denoted DDx2 and DDx4 resp.). Naturally, with 2 regions the algorithm can utilize only 2 CPUs. When the number of regions is increased, the random event of termination is expected to happen less likely, which is confirmed by our experiments.

RPR

A recently published implementation of Region Push Relabel (Delong and Boykov 2008) by Sameh Khamis (v1.01, http://vision.csd.uwo.ca/code/).

For RPR we constructed partition of the problem into smaller blocks. Because regions in RPR are composed dynamically out of blocks (default is 8 blocks per region) we partitioned 2D problems into 64=82 blocks and 3D problems into 512=83 blocks. This partition was also empirically faster than a coarser one. The parameter DischargesPerBlock was set by recommendation of authors to 500 for small problems (stereo) and to 15000 for big problems. The implementation is specialized for regular grids, therefore multiview and KZ2 problems which do not have regulargrid hint cannot be solved by this method. Because of the fixed graph layout in RPR, arcs which are reverse of each other are automatically grouped together, so RPR computes on a reduced graph compared to other methods. Let us also note that because of the dynamic regions, RPR is not fully suitable to run in a distributed system.

The method of Liu and Sun (2010) (parallel, but not distributed) would probably be the fastest one in this competition (as could be estimated from the results reported by Liu and Sun (2010)), however the implementation is not publicly available.

Results

The results are summarized in Table 3. The time reported is the wall clock time passed in the calculation phase, not including any time for graph construction. The number of sweeps for DD has the same meaning as for P-ARD/PRD, it is the number of times all regions are synchronously processed. RPR however is asynchronous and uses dynamic regions. For it, we define sweeps = block_discharges/number_of_blocks.

Table 3 Parallel Competition

Comparing to Table 1, we see that P-ARD on 4 CPUs is about 1.5–2.5 times faster than S-ARD. The speed-up over BK varies from 0.8 on livern6c10 to more than 4 on gargoyle.

We see that DD gets lucky some times and solves the problem really quickly, but often it fails to terminate. We also observe that our variant of P-PRD (based on highest first selections rule) is a relatively slow, but robust distributed method. RPR, which is based on LIFO selection rule, is competitive on the 3D segmentation problems but is slow on other problems, despite its compile-time optimization for the particular graph structure. It is also uses relatively higher number of blocks, The version we tested always returned the correct flow value but often a wrong (non-optimal) cut. Additionally, for 26 connected bone_subxy.n26c100 it failed to terminated within 1 hour.

7.4 Scalability with Processors

We performed additional tests of P-ARD in the shared memory mode using 1–8 CPUs. This experiment is conducted on a system with Intel(R) Core(TM)i7 CPU 870@2.9 GHz, 16 GB memory, Linux 64 bit and gcc compiler. The plot in Fig. 14 shows the speedup of solving the problem (excluding initialization) using multiple CPUs over the time needed by a single CPU. For this test, we selected medium and large size problems of different kind which can fully fit in 16 GB of memory. The two problems which were taking longer in the serial implementation scaled relatively well. On the other side, the largest LB07-bunny problem did not scale well. We believe that the limiting factor here is the memory bandwidth. We inspected that the sequential part of the computation (boundary relabel heuristic, synchronous message exchange) occupy less than 10 % of the total time for all four problems. The fully parallel part should exhibit a linear speed-up in the ideal case of even load. The load for LB07-bunny should be relatively even, since we have enough regions (64) to be processed with 8 CPUs. Still, there is no speed-up observed in the parallel part of the first sweep (where most of the work is done) when scaling from 4 to 8 CPUs.

Fig. 14
figure 22

Speedup of P-ARD with the number of CPUs used. The extended legend shows the time to solve each problem with 1 and 8 CPUs (does not include initialization). Dashed lines correspond to the speedup in the ideal case (Amdahl’s law) when the parallel portion of the computation is 90 % and 95 %

It is most probable that reducing memory requirements (e.g. by having dedicated graph implementation for regular grids) would also lead to a speed-up of the parallel solver. We also observed that 32 bit compilation (pointers take 32 bits) runs faster than 64 bit compilation. It is likely that our implementation can be optimized further for the best parallel performance. We should however consider that preparing the data for the problem and splitting it into regions is another time-consuming part, which needs to be parallelized.

8 Region Reduction

In this section, we attempt to reduce the region network as much as possible by identifying and eliminating vertices which can be decided optimally regardless of the reminder of the full network outside of the region. If it was possible to decide about many vertices globally optimally inside a region network, the whole problem would simplify a lot. It would require less memory and could be potentially solved without distributing and or partitioned again into larger regions. We propose an improved algorithm for such a reduction and its experimental verification. This preprocessing is studied separately and was not applied in the tests of distributed algorithms above. Experiments with vision problems (Table 4) showed that while 2D problems can be significantly reduced, many of the higher-dimension problems do not allow a substantial reduction.

Table 4 Percentage of vertices which can be decided by preprocessing. The problems are partitioned into regions the same way as in Table 1. For stereo problems the average number over subproblems is shown

Some vertices become disconnected from the sink in the course of the studied algorithms (S/P-ARD, S/P-PRD). If they are still reachable from the source, they must belong to the source set of any optimal cut. Such vertices do not participate in further computations and the problem can be reduced by excluding them. Unfortunately, the opposite case, when a vertex must be strictly in the sink set is not discovered until the very end of the algorithms.

The following algorithm attempts to identify as many vertices as possible for a given region. It is based on the following simple consideration: if a vertex is disconnected from the sink in G R as well as from the region boundary, B R, then it is disconnected from the sink in G; if a vertex is not reachable from the source in G R as well as from B R then it is not reachable from the source in G.

Let us say that a vertex v is a strong source vertex (resp. a strong sink vertex) if for any optimal cut \((C, \bar{C})\), vC (resp. \(v\in \bar{C}\)). Similarly, v will be called a weak source vertex (resp. weak sink vertex), if there exists an optimal cut \((C, \bar{C})\) such that vC (resp. \(v\in \bar{C}\)).

Kovtun (2004) suggested to solve two auxiliary problems, modifying network G R by adding infinite capacity links from the boundary vertices to the sink and in the second problem adding infinite capacity links from the source to the boundary vertices. In the first case, if v is a strong source vertex in the modified network G R, it is also a strong source vertex in G. Similarly, the second auxiliary problem allows to identify strong sink vertices in G. It requires solving a maxflow problem on G R twice. We improve this construction by reformulating it as the following algorithm finding a single flow in G R.

Statement 12

Sets B S and B T constructed in step 2 are disjoint.

Proof

We have st after step 1, hence there cannot exist simultaneously a path from s to v and a path from v to t. □

After step 1, the network G R is split into two disconnected networks: with vertices reachable from s and vertices from which t is reachable. Therefore, any augmentations occurring in steps 4 and 5 act on their respective subnetworks and can be carried independently of each other. On the output of Algorithm 3, we have: sB R∪{t} and B R∪{s}↛t. The classification of vertices is shown in Fig. 15.

Algorithm 3
figure 23

Region Reduction (G R,B R)

Fig. 15
figure 24

Classification of vertices in V R build by Algorithm 3. Vertices reachable from s are strong source vertices. Vertices from which t is reachable are strong sink vertices. The remaining vertices can be classified as weak source (a) if they cannot reach boundary, or as weak sink (b) if they are not reachable from the boundary. Some vertices are both: weak source and weak sink, this means they can be on both sides of an optimal cut (but not independently)

Augmenting on (s,t) in step 1 and on (s,B S) in the step 4 is the same work as done in ARD (where (s,B S) paths are augmented in the order of labels of B S). This is not a coincidence, these algorithms are very much related. However, the augmentation on (B T,t) in step 5 cannot be executed during ARD. It would destroy validity of the labeling. We therefore consider Algorithm 3 as a separate general preprocessing.

If v is a weak source vertex, it follows that it is not a strong sink vertex. In the preflow pushing algorithms, we find the cut \((\bar{T},T)\), where T is the set of all strong sink vertices in G. We consider that v is decided if it is a strong sink or a weak source vertex.

Table 4 gives the percentage of how many vertices are decided (and hence can be excluded from the problem) by Algorithm 3 for computer vision problems. It is seen that in stereo problems, a large percent of vertices is decided. These problems are rather local and potentially can be fully solved by applying Algorithm 3 on several overlapping windows. In contrast, only a small fraction can be decided locally for many other problems.

9 Tightness of O(n 2) Bound for PRD

In this section, we give an example of a network, its partition into regions and a sequence of valid push and relabel operations, implementing PRD, such that S/P-PRD runs in Ω(n 2) sweeps.

We start by an auxiliary example, in which the preflow is transfered from a vertex to a boundary vertex with a higher label. In this example, some inner vertices of a region are relabeled, but not any of the boundary vertices. It will imply that the total number of sweeps cannot be bounded by the number of relabellings of boundary vertices alone.

Example 1

Consider a network of 6 regular vertices in Fig. 16. Assume all edges have infinite capacity, so only non-saturating pushes occur. There are two regions R 1={1,2,3,4,5} and R 2={6}. Figure 16 shows a sequence of valid push and relabel operations. We see that some vertices get risen due to relabel, but the net effect is that flow excess from vertex 1 is transfered to vertex 6, which had a higher label initially. Moreover, none of the boundary vertices (vertices 5,6) are relabeled.

Fig. 16
figure 25

Steps of Example 1. The height of a vertex correspond to its label. The black box shows the vertex with excess in each step. The source and the think vertices are not shown. (ab) flow excess is pushed to vertex 2; (c) vertex 2 is relabeled, so that two pushes are available and excess is pushed to vertex 3; (df) similar

Example 2

Consider the network in Fig. 17. The first step corresponds to a sequence of push and relabel operations (same as in Fig. 16) applied to the chain (1,2a,3a,4a,5,6). Each next step starts with the excess at vertex 1. Chains are selected in turn in the order a,b,c. It can be verified from the figure that each step is a valid possible outcome of PRD applied first to R 1 and then to R 2. The last configuration repeats the first one with all labels raised by 6, so exactly the same loop may be repeated many times.

Fig. 17
figure 26

Steps of Example 2. Top left: a network with several chains of vertices like in Example 1. Nodes 1, 5, 6 are common for all chains but there are separate copies of vertices 2, 3, 4 denoted by letters. In addition, there is a reverse arc from vertex 6 to vertex 1. From left to right, top to bottom: one step of transferring a flow from vertex 1 to vertex 6 using one of the chains and then pushing it through the arc (6,1), relabeling 6 when necessary. The label of the first vertex is increased three times by 2

It is seen that vertices 1,5,6 are relabeled only during pushes in the chains a and b and never during pushes in chain c. If there were more chains like chain c, it would take many iterations (= number of region discharge operations) before boundary vertices are risen. Let there be k additional chains in the graph (denoted d,e,…) handled exactly the same way as chain c. The total number of vertices in the graph is n=3k+const. Therefore, it will take Ω(n) region discharges to complete each loop, raising all vertices by a constant value. The number of discharges needed in order that vertex 1 reaches a label D, is Ω(nD). To make the example complete, we add a chain of vertices initially having labels 1,2,3,…,D to the graph such that there is a path from vertex 1 to the sink through a vertex with label D. Clearly, we can arrange that D=Ω(n). The algorithm needs Ω(n 2) discharges on this example.

Because there is only one active vertex at any time, the example is independent of the rule used to select the active vertex (highest label, FIFO, etc.). By the same reason, it also applies to parallel PRD. Because the number of regions is constant, the number of sweeps required is also Ω(n 2).

For comparison, noting that the number of boundary vertices is 3, we see that S-ARD algorithm will terminate on this example in a constant number of sweeps for arbitrary k.

10 Relation to Dual Decomposition

In our approach, we partition the set of vertices into regions and couple the regions by sending the flow through the inter-region edges. In the dual decomposition for mincut (Strandmark and Kahl 2010) detailed below, a separator set of the graph is selected, each subproblem gets a copy of the separator set and the coupling is achieved via the constraint that the cut of the separator set must be consistent across the copies. We now show how the dual variables of Strandmark and Kahl (2010) can be interpreted as flow, thus relating their approach to ours.

Decomposition of the mincut problem into two parts is formulated by Strandmark and Kahl (2010) as follows. Let M,NV are such that MN=V, {s,t}⊂MN and there are no edges in E from MN to NM and vice-versa. Let x:M→{0,1} and y:N→{0,1} be the indicator variables of the cut set, where 0 corresponds to the source set. Then the mincut problem without excess can be reformulated as:

(15)

where

(16a)
(16b)
(17a)
(17b)
(17c)

\(E^{M} = (M,M) \buildrel \text {d{}ef}\over =(M\times M)\cap E\) and E N=(N,N). The minimization over x and y decouples once the constraint x i =y i is absent. The dual decomposition approach is to solve the dual problem:

(18)

where the dual variable λ is multiplied by the extra terms (1−x s )=(1−y s )=1 to show explicitly that the inner minimization problems are instances of the minimum cut problem.

We observe that dual variables λ correspond to the flow on the artificial edges of infinite capacity between the copies of the vertices of the separator set as illustrated by Fig. 18. Indeed, consider a vertex v in the separator set. The dual variable λ v contributes to the increase of the terminal link (v′,t) in the subproblem M and to the decrease of the terminal link (v″,t) in the subproblem N. This can be equivalently represented as an augmentation of flow of λ v on the cycle v′,t′,v″ in the network Fig. 18(c). The optimal flow in the network Fig. 18(c) on the constraint edges will therefore correspond to the optimal λ. This construction could be easily extended to the case when a vertex v from the separator set is shared by more than two subproblems.

Fig. 18
figure 27

Interpretation of the dual decomposition. (a) Example of a network with denoted capacities. Terminal capacities are shown in circles, where “+” denotes s-link and “−−” denotes t-link. MN is a separator set. (b) The network is decomposed into two networks holding copies of the separator set. The associated capacities are divided (not necessarily evenly) between two copies. The variable λ 1 is the Lagrangian multiplier of the constraint x v =y v . (c) Introducing edges of infinite capacity enforces the same constraint, that v′ and v″ are necessarily in the same cut set of any optimal cut. (d) A maximum flow in the network (c), the flow value on the red edges corresponds to the optimal value of the dual variables λ

There exist an integer optimal flow for a problem with integer capacities. This observation provides an alternative proof of the theorem (Strandmark and Kahl 2010, Theorem 2),Footnote 8 stating that there exist an integer optimal λ. Despite the existence of an integer solution, the integer subgradient algorithm (Strandmark and Kahl 2010) is not guaranteed to find it.

The algorithm we introduced could be applied to such a decomposition by running it on the extended graph Fig. 18(c), where vertices of the separator set are duplicated and linked by additional edges of infinite capacity. It could be observed, however, that this construction does not allow to reduce the number of boundary vertices or the number of inter-region edges, while the size of the regions increases. Therefore it is not beneficial with our approach.

11 Conclusion

We developed a new algorithm for mincut problem on sparse graphs, which combines augmenting paths and push-relabel approaches. We proved the worst case complexity guarantee of \(O(|\mathcal {B}|^{2})\) sweeps for the sequential and parallel variants of the algorithm (S/P-ARD). While there are many algorithms in the literature with complexities in terms of elementary arithmetic operations better than we could possibly prove, we showed that our algorithms are fast and competitive in practice, even in the shared memory model. We proposed an improved algorithm for local problem reduction (Sect. 8) and determined that most of our test instances are difficult enough in the sense that very few vertices can be decided optimally by looking at individual regions. The result that S/P-ARD solves test problem in few tens of sweeps is thus non-trivial. We also gave a novel parallel version of the region push-relabel algorithm of Delong and Boykov (2008) and a number of auxiliary results, relating our approach to the state-of-the.

Both in theory and practice (randomized test), S-ARD has a better asymptote in the number of sweeps than the push-relabel variant. Experiments on real instances showed that when run on a single CPU and the whole problem fits into the memory, S-ARD is comparable in speed with the non-distributed BK implementation, and is even significantly faster in some cases. When only a single region is loaded into memory at a time, S-ARD uses much fewer disk I/O than S-PRD. We also demonstrated that the running time and the number of sweeps are very stable with respect to the partition of the problem into up to 64 regions. In the parallel mode, using 4 CPUs, P-ARD achieves a relative speedup of about 1.5–2.5 times over S-ARD and uses just slightly larger number of sweeps. P-ARD compares favorably to other parallel algorithms, being a robust method suitable for a use in a distributed system.

Our algorithms are implemented for generic graphs. Clearly, it is possible to specialize the implementation for grid graphs, which would reduce the memory consumption and might reduce the computation time as well.

A practically useful mode could be actually a combination of a parallel and sequential processing, when several regions are loaded into the memory at once and processed in parallel. There are several particularly interesting combinations of algorithm parallelization and hardware, which may be exploited: (1) parallel on several CPUs, (2) parallel on several network computers, (3) sequential, using Solid State Drive, (4) sequential, using GPU for solving region discharge.

There is the following simple way how to allow region overlaps in our framework. Consider a sequential algorithm, which is allowed to keep 2 regions in memory at a time. It can then load pairs of regions (1,2),(2,3),(3,4),… , and alternate between the regions in a pair until both are discharged. With PRD, this is efficiently equivalent to discharging twice larger regions with a 1/2 overlap and may significantly decrease the number of sweeps required. In the case of a 3D grid, it would take 8 times more regions to allow overlaps in all dimensions. However, to meet the same memory limit, the regions have to be 8 times smaller. It has to be verified experimentally whether it is beneficial. In fact, the RPR implementation of Delong and Boykov (2008) uses exactly this strategy: a dynamic region is composed out of a number of smaller blocks and blocks are discharged until the whole region is not discharged. It is likely that with this approach we could further reduce the disk I/O in the case of the streaming solver.

Future Work

We plan to provide also a distributed MPI-based implementation as well as address the question of dynamic updates of the problem and implement a more efficient input-output interface for the use in real applications. The memory-efficient representation of graphs having repetitive structure is also possible.