Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Dynamic programming, formally introduced by Richard Bellman (August 26, 1920–March 19, 1984) at the Rand Corporation in Santa Monica in the 1940s, is a versatile method to construct efficient algorithms for a broad range of problems. As with many tools which evolved over time, the original paper sounds antiquated. In his monograph “Dynamic Programming,” Bellman [25] describes the principle of optimality, which is central to dynamic programming: “An optimal policy has the property that whatever the initial state and initial decisions are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.” Not exactly what is found in textbooks today. But then in those days the focus was more on stochastic processes and not so much on combinatorial optimization. The term “programming” is outdated as well, as it does not refer to programming in a modern programming language. Instead, programming means planning by filling in tables. The term “linear programming” derives in the same way. Over the decades, dynamic programming has evolved and is now one of the “killer” techniques in algorithmic design. And, of course, the laptop computer on which this chapter was written is more powerful than anyone in Bellman’s days could have imagined.

Today, the emphasis is on how to organize dynamic programming in a way that makes it possible to solve massive problems (which – again – would have never been considered in Bellman’s days) in reasonable time. The focus is on dynamic programming speedup. Such speedup comes from carefully observing what values are essential and need to be computed and which might be unnecessary. Keeping proper look-up tables on the side can accomplish this sometimes. But there are many situations where there are inherent monotonicity properties which can be exploited to only calculate a fraction of what is necessary in a simple-minded approach.

The goal of this chapter is to briefly introduce dynamic programming, show some of the diversity of problems that can be tackled efficiently with dynamic programming, and – centrally – focus on the issue of dynamic programming speedup. Clearly, the chapter does not cover all of dynamic programming, but there is a section on recommended reading, Sect. 8, which covers some ground.

The chapter is organized as follows: Sect. 2 starts with a simple introduction to dynamic programming using a few standard examples, described more tersely than in a textbook and augmented with a general few themes. A reader altogether unfamiliar with dynamic programming might utilize some of the resources given in Sect. 8.

Section 3 contains “open-ended dynamic programming,” where a process is updated continuously. This is closely related to the theory of online optimization. In online computation, an algorithm must make decisions without knowledge of future inputs. Online problems occur naturally across the entire field of discrete optimization, with applications in areas such as robotics, resource allocation in operating systems, and network routing. Dynamic programming plays an important role in this kind of optimization. Notably, work functions replace tables or matrices used in offline optimization.

Section 4 contains an example from sorting. The intent of this section is twofold: First, lesser-known dynamic programming techniques for sorting are highlighted. Second, here is an example where a straightforward simple-minded implementation might give an algorithm of cubic complexity or worse, and a more intricate setup solves the problem in quadratic time, thus achieving substantial dynamic programming speedup.

Section 5 gives an example from scheduling (a batching problem) to illustrate important properties for dynamic programming speedup: the Monge property and total monotonicity. Techniques exploiting these techniques are now standard, and the reader might consult Sect. 8 to see how prolific such techniques are.

Speedup is based on two important and intricate algorithms: SMAWK and LARSCH. Use of these is essential for many fast dynamic programming algorithms. However, these algorithms are currently only accessible through the original publications. Section 6 contains “ready to implement” descriptions of the SMAWK and LARSCH algorithms.

Another type of speedup is based in the Knuth-Yao quadrangle inequality; this dynamic programming speedup works for a large class of problems. Even though both the SMAWK algorithm and the Knuth-Yao (KY) speedup use an implicit quadrangle inequality in their associated matrices, on second glance, they seem quite different from each other. Section 7 discusses the relation between these kinds of speedup in greater detail.

Finally, Sect. 8 is a cross-reference list with other chapters, and Sect. 8 gives concluding remarks.

2 A Few Standard Introductory Examples

As mentioned earlier, this section has a few introductory examples.

2.1 A Brief Introduction to Dynamic Programming: Fibonacci, Pascal, and Making Change

Many problems can be solved recursively by dividing an instance into subinstances. But a direct implementation of a recursion is often inefficient because subproblems overlap and are recomputed numerous times. The calculation of the Fibonacci numbers provides a simple example:

$$ f_{n} = \left\{\begin{array}{lll} f_{n-1} + f_{n-2} & {\mathrm{if}}n> 1 \\ 1 & {\mathrm{if}} n = 1 \\ 0 & n = 0. \end{array} \right. $$
(1)

Recursively, the Fibonacci numbers can be calculated in the following way:

$$\displaystyle\begin{array}{rcl} \mathbf{function}& & fib(n) \\ & & \mathbf{if}\ n = 0\,\mathrm{or}\,n = 1\ \mathbf{return}\ n \\ & & \mathbf{else}\ \mathbf{return}\ fib(n - 1)\, +\, fib(n - 2)\end{array}$$

This is extremely inefficient; many values will be recalculated; see Fig. 1. Instead, one could use an array and would simply fill in values from “left to right,” starting with F 0 and F 1 and the using Eq. 1 to continue. This way every value is only calculated once.

Fig. 1
figure 1

Recursive calculation of the fifth Fibonacci number

The Pascal triangle for calculating binomial coefficients provides a good example for the use of tables in dynamic programming. Recall the definition of the binomial coefficient:

$$\left (\begin{array}{l} n\\ k \end{array} \right ) = \left \{\begin{array}{lll} \left (\begin{array}{l} n - 1\\ k - 1 \end{array} \right ) + \left (\begin{array}{c} n - 1\\ k \end{array} \right )&&\mbox{ if }0 < k < n\\ \ & &\ \\ 1 &&\mbox{ if }k = 0\mbox{ or }k = n\\ \ & &\ \\ 0 & &\mbox{ else}. \end{array}\right.$$
(2)

Clearly, the coefficients can be calculated using the following recursive program:

$$ \begin{array}{lll} \mathbf{function} & c(n,k) \\ &\mathbf{if} k = 0\ {or}k = n \ \mathbf{return} 1 \\ &\mathbf{else} \mathbf{return}\ c(n - 1, k - 1) + c(n -1,k) \end{array} $$

But again this is inefficient, as terms are recalculated over and over. Indeed, since the solution is ultimately made up of terms of value 1. the run time of this algorithm is \(\Omega (\left ({ n \atop k} \right ))\). This problem consists of overlapping subproblems, which often makes it inefficient to use recursions directly. Instead, one can calculate the coefficients bottom up using a table to store intermediate results. In the case of the binomial coefficient, this is the Pascal triangle Fig. 2. With it the run time is \(\Theta (nk)\) with space requirement \(\Theta (k)\) (as only two rows at a time need to be stored).

Fig. 2
figure 2

The Pascal triangle

Most often dynamic programming is used for optimization problems. As an example consider the problem of making change with the minimum number of coins. Given are n denominations of value \(\{d_{1},\ldots d_{n}\}\), an unlimited supply of these coins, and the problem is to make change for amount A. For example, if the denominations are 1, 5, 10, 12, and A = 21, then the minimum number of coins is 3. Note that the greedy algorithm uses 6 coins. What is important here is that the principle of optimality holds: If optimal change for amount C involves making change into amounts A and B, C = A + B, then change for A and B is also optimal. More generally, the principle states that in an optimal sequence of decisions or choices, each subsequence must also be optimal.

Let

$$\displaystyle\begin{array}{rcl} C[i,j]& =& \mbox{ minimum number of coins required to make change for amount }j \\ & & \mbox{ using only coins of denomination }\{1,\ldots,i\}, \\ \end{array}$$

where \(i = 1,\ldots n\) and \(j = 1,\ldots A\). The principle of optimality implies the following recurrence:

$$ C [i,j] = \min\{ C[i - 1,j],1 + C[i,j - d_{i}]\},$$
(3)

where out-of-bounds values are set to \(\infty \).

Just as before with Fibonacci and Pascal, one uses a table to calculate values “bottom-up.” The table is initialized by setting all C[i, 0] to 0. Then the table is filled row by row using Eq. 3. An example is in Table 1.

Table 1 An instance of the change-making problem: There are four types of coins with denominations d 1 = 1, d 2 = 5, d 3 = 10, and d 4 = 12. The amount is A = 21. The last entry in the table gives the minimum number of coins, which is 3. Note that the greedy algorithm uses six coins

2.2 Chained Matrix Multiplication Problem

Given is a sequence of matrices \(M_{1},\ldots,M_{n}\) to be multiplied. The dimensions are \(d_{0} \times d_{1},\ldots d_{n-1} \times d_{n}\), denoted by \((d_{0},d_{1},\ldots,d_{n})\), for short. The question is what parenthesization gives the minimum number of multiplications. For example, given three matrices, A, B, C, with dimensions (12, 5, 90, 2), the calculation A(BC) requires \(5 \times 90 \times 2\) multiplications to produce BC and then \(12 \times 5 \times 2\) to multiply the result by A, for a total of 1,020. The order (AB)C is much worse: 7500 multiplications. For general n, exhaustive search is prohibitive for the following reason: Let

$$T(n) = \mbox{ number of ways to parenthesize a product of }n\mbox{ matrices},$$

then

$$T(n) =\displaystyle\sum _{ i=1}^{n-1}T(i)T(n - i)$$
(4)

with \(T(2) = T(1) = 1\). The solution to recurrence 4 is

$$T(n) = \frac{1} {n}\left (\begin{array}{c} 2n - 2\\ n - 1 \end{array} \right ),$$
(5)

which is \(\Omega ({4}^{n}/{n}^{2})\).

Clearly the principle of optimality applies. If the parenthesization is optimal for \(M = M_{1},\ldots,M_{n}\) and there are two parts \(A = M_{1},\ldots,M_{k}\) and \(B = M_{k+1},\ldots,M_{n}\) such that M = AB, then the parenthesizations for A and B are optimal. Let

$$M[i,j] = \mbox{ optimal number of multiplications to compute }M_{i}\cdots M_{j}.$$

Thus, the recurrence is

$$M[i,j] =\min\limits_{i\leq k<j}\{M[i,k] + M[k + 1,j] + d_{i-1}d_{k}d_{j}\}\mbox{ for }1 \leq i < j \leq n,$$
(6)

with M[i, i] = 0.

The dynamic program fills a table with entries for \(i \leq j\), starting with the main diagonal, which is set to M[i, i] = 0. The computation then progresses to M[i, i + 1] and so forth until the element in the north-east corner M[1, n] is reached. In order to be able to recover the solution, the index k which gives the minimum in 6 is also stored. When the algorithm reaches (1, n), this index gives the index of the first cut. One then proceeds recursively on both sides to construct the actual parenthesization. Figure 3 gives an example.

Fig. 3
figure 3

An example for the dynamic program for the problem “chained matrix multiplication.” In the example, n = 4 and the dimensions are (12, 5, 90, 2, 35). The calculation proceeds by first filling the table for indices with ji = 0 (i.e., initialization of M[i, i]) and then continues along the diagonals ji = 1, 2, 3. Calculations which give the minima for each cell in Eq. 6 are shown in bold type, and the corresponding index k is stored. Once the value of M[1, 4] is known, then the solution can be recovered using these indices: (M 1 M 2 M 3)M 4 can be concluded from the index k = 3 in cell (1, 4). Next, look up cell (1, 3) to find the parenthesization \(M_{1}(M_{2}M_{3})\)

2.3 Shortest Paths

Consider a directed graph on nodes \(V =\{1,\ldots,n\}\) with distance matrix \(D[i,j] \geq \) 0, where \(D[i,j] = \infty \) if there is no edge between node i and node j. The object is to calculate the shortest path between each pair of nodes. To use dynamic programming, one defines

$$D_{k}[i,j] = \mbox{ length of a shortest path from $i$ to $j$ which only uses nodes $\{1,\ldots,k\}$}.$$

Clearly D 0 = D. Starting with D 0 one calculates \(D_{1},D_{2},\ldots D_{n}\) using the following recursion:

$$D_{k}[i,j] =\min\{ D_{k-1}[i,j],D_{k-1}[i,k] + D_{k-1}[k,j]\}.$$
(7)

The algorithm, which computes this series of tables, is called the Floyd algorithm. The reader is invited to follow the calculations in Table 2, which gives an example for the graph in Fig. 4. The principle of optimality is at work here: If a path from node i to node j is optimal and passes through node k, then the path from i to k, as well as the path from k to j, is optimal. This is not true for the longest simple path problem. (A simple path is a path without repeated nodes.) In graph of Fig. 4, the longest simple path from node 2 to node 4 is 2, 1, 3, 4, but the path 2, 1 is not a longest simple path from 2 to 1. Indeed the problem “longest simple path” is \(\mathcal{N}\mathcal{P}\)-hard. This does not contradict the fact that the Floyd algorithm works for negative distances if there are no negative cycles. (Longest simple path cannot be reduced to shortest path with negative distances because of the cycle restriction.) The Floyd algorithm can be used to detect negative cycles by checking for negative elements in the main diagonal of D n .

Table 2 The Floyd algorithm for the example of Fig. 4. One proceeds from matrix D k − 1 to matrix D k by applying the rule \(D_{k}[i,j] =\min\{ D_{k-1}[i,j],D_{k-1}[i,k] + D_{k-1}[k,j]\}\) for all i and j. In the lower right corner of the table, the matrix of pointers P is shown. For example, the length of a shortest path from node 3 to node 2 is 4 since D 4[3, 2] = 4. To construct the path look up P[3, 2] = 4 to find that the path goes through node 4. Recursively, look up P[3, 4] = P[4, 2] = 0. Thus the shortest path is 3, 4, 2
Fig. 4
figure 4

A directed graph with distances

The time complexity of Floyd algorithm is O(n 3). The calculation can be performed by keeping only D k − 1 and D k at each iteration, and thus the space requirement is O(n 2). In order to retrieve the actual path (and not only its length), a matrix P of pointers is used, where P[i, j] contains the number of the last iteration k that causes a change in D k [i, j] (P[i, j] is initialized to 0). To construct the path between node i and j, look up P[i, j] at the end. If P[i, j] = 0, then there was never a change for any D k and the shortest path is the edge (i, j), else the shortest path goes through k. Recursively examine P[i, k] and P[k, j].

The resulting algorithm is named after Floyd (It appeared as a one-page note in the Communications of the ACM – together with other algorithms of the time. Therefore, the antiquated title is “Algorithm 97: Shortest Path,” Floyd [55]).

Another algorithm of the time is the matrix algorithm, originally given in the context of the transitive closures by Warshall. The paper is a short two-pager; see Warshall [92].

Define

$${D}^{(k)}[i,j] = \mbox{ length of a shortest path from }i\mbox{ to }j\mbox{ containing at most }k\mbox{ edges}$$

and set

$${D}^{(0)} == \left \{\begin{array}{ll} 0 &\mbox{ if }i = j\\ \ &\ \\ \infty &\mbox{ else}. \end{array} \right.$$

Clearly by the principle of optimality,

$$\displaystyle\begin{array}{rcl}{ D}^{(k)}[i,j]& =& \min\{{D}^{(k-1)}[i,j],\min\limits_{1\leq \ell\leq n,\ell\neq j}\{{D}^{(k-1)}[i,\ell] + D[l,j]\}\}\end{array}$$
(8)
$$\displaystyle\begin{array}{rcl} & =& \min{1\leq \ell\leq n}\{{D}^{(k-1)}[i,\ell] + D[l,j]\}.\end{array}$$
(9)

The previous equation defines a matrix multiplication where the usual multiplication means “+” and the operation “+” means “min.” Table 3 gives an example. In other words,

$${D}^{(k)} = {D}^{k}.$$

The solution to the shortest path problem is therefore obtained by calculating D k with the operators properly replaced. The run time of this algorithm is at first glance O(n 4), but it can be improved by “repeated squaring”: Calculate

  • \({D}^{2} = D \cdot D\)

  • then \({D}^{4} = {D}^{2} \cdot {D}^{2}\)

  • then \({D}^{8} = {D}^{4} \cdot {D}^{4}\), and so forth

Table 3 Example of the progression of the matrix algorithm for the all shortest path problem given in Fig. 4. The operations in the “matrix multiplication” are not the usual “ + ” and “\(\cdot \)” but rather “\(\min\)” and “ + ”

If n − 1 is not a power of 2, then D n − 1 can be obtained by going up to the highest power of 2 smaller than (n − 1) and multiplying the powers of the binary representation of (n − 1). As a result the number of matrix multiplications is only logarithmic, giving a run time of \(O({n}^{3}\log n)\).

2.4 The Knapsack Problem

This is a classical problem in combinatorial optimization: Given are n items \(\{1,\ldots,n\}\) with weights w i > 0 and profits p i > 0, and there is a knapsack of weight capacity W > 0. One is to fill the knapsack in such a way that the profit of the items chosen is maximized while obeying that the total weight be less than W. Let

$$\displaystyle\begin{array}{rcl} P[i,j]& =& \mbox{ maximum profit, which can be obtained if the} \\ & \ & \mbox{ weight limit is }j\mbox{ and only items from}\{1,\ldots i\}\mbox{ may be included,} \\ \end{array}$$

where \(1 \leq i \leq n\) and \(0 \leq j \leq W\). The following recursion is valid:

$$P[i,j] =\max \{ P[i - 1,j],P[i - 1,j - w_{i} + p_{i}\},$$
(10)

where P[0, j] = 0 and \(P[i,j] = -\infty \) when j < 0. Equation 10 says that for a new element i one does not include it (left of \(\max\)) or one does include it (right of \(\max\)). If one includes the element, then by the principle of optimality the problem with capacity reduced by the weight of element i on the earlier elements \(\{1,\ldots,i - 1\}\) must be optimal.

The table P[i, j] can be filled in either row by row or column by column fashion. The run time of this algorithm is \(\Theta (nW)\). Note that this is not a polynomial algorithm for the knapsack problem. The input size is \(O(n\log \max _{i=1,\ldots,n}w_{i} +\log W)\), which means that \(\Theta (nW)\) is exponential in the input size. This is, of course, to be expected as the knapsack problem is \(\mathcal{N}\mathcal{P}\)-hard. The algorithm’s complexity is called pseudo-polynomial. Many dynamic programs give pseudo-polynomial algorithms.

Pseudo-polynomial, Strongly Polynomial. Let s be the input to some decision problem \(\Pi \). Let \(\vert s\vert _{\log }\) be the length of the input – that is,  the length of the binary encoding – and let \(\vert s\vert _{\max }\) be the magnitude of the largest number in s. A problem \(\Pi \) is pseudo-polynomially solvable if there is an algorithm for \(\Pi \) with run time bounded by a polynomial function in \(\vert s\vert _{\max }\) and \(\vert s\vert _{\log }\). A decision problem is a number problem if there exists no polynomial p such that \(\vert s\vert _{\max }\) is bounded by \(p(\vert s\vert _{\log })\) for all input s. (For example, the decision version of the knapsack problem is a number problem.) Note that by these definitions it immediately follows that an \(\mathcal{N}\mathcal{P}\)-complete non-number problem (such as HAMILTONIAN CIRCUIT, for example) cannot be solved by a pseudo-polynomial algorithm (unless \(\mathcal{P} =\) \(\mathcal{N}\mathcal{P}\)). For polynomial p let \(\Pi _{p}\) denote the subproblem which is obtained by restricting \(\Pi \) to instances with \(\vert s\vert _{\max }\leq p(\vert s\vert _{\log })\). Problem \(\Pi \) is called strongly \(\mathcal{N}\mathcal{P}\) -complete if P is in \(\mathcal{N}\mathcal{P}\) and there exists a polynomial p for which \(\Pi _{p}\) is \(\mathcal{N}\mathcal{P}\)-complete. It is easy to see (Table 4):

Theorem 1

A strongly \(\mathcal{N}\mathcal{P}\) -complete problem cannot have a pseudo-polynomial algorithm unless \(\mathcal{P} =\) \(\mathcal{N}\mathcal{P}\).

2.5 Binary Search Tress

The construction of optimal binary search trees is a classic optimization problem. One is interested in constructing a search tree, in which elements can be looked up as quickly as possible. The first dynamic program for this problem was given by Gilbert and Moore [59] in the 1950s. More formally, given are n search keys with known order \(\mbox{ Key}_{1} < \mbox{ Key}_{2} < \cdots < \mbox{ Key}_{n}\). The input consists of 2n + 1 probabilities \(p_{1},\ldots,p_{n}\) and \(q_{0},q_{1},\ldots,q_{n}\). The value of p l is the probability that a search is for the value of \(\mbox{ Key}_{l}\); such a search is called successful. The value of q l is the probability that a search is for a value between \(\mbox{ Key}_{l}\) and \(\mbox{ Key}_{l+1}\) (set \(\mbox{ Key}_{0} = -\infty \) and \(\mbox{ Key}_{n+1} = \infty \)); such a search is called unsuccessful. Note that in the literature the problem is sometimes presented with weights instead of probabilities, in that case the p l and q l are not required to add up to 1.

Table 4 An example of the pseudo-polynomial dynamic programming algorithm for the knapsack problem on nine items. The weights are w[1] = 1, w[2] = 2, w[3] = 3, w[4] = 4, w[5] = 5, w[6] = 6, w[7] = 7, w[8] = 8, w[9] = 9 and the profits are p[1] = 1, p[2] = 2, p[3] = 5, p[4] = 10, p[5] = 15, p[6] = 16, p[7] = 21, p[8] = 22, p[9] = 35. The size of the knapsack is 15. The table shows in position (i, j) the maximum profit, which can be obtained if the weight limit is j and only items from {1, …i} may be included. Bold entries indicate that in Eq. 10 the second choice is the maximizer, i.e., the new item is considered for inclusion. From the regular-bold information the actual solution can be reconstructed: The last entry in the table is 50. Because of the bold type face item i = 9 is included. Thus one looks w 9 = 9 many cells to the left in the previous row. The regular type face of 16 in cell (7, 6) means that item 8 is not included, looking at the cells above neither are items 7 and 6. Item 5 is included, next look-up cell (4, 1) to see that no more items are included. The solution is {5, 9} with a profit of 50

The binary search tree constructed will have n internal nodes corresponding to the successful searches, and n + 1 leaves corresponding to the unsuccessful searches. The depth of a node is the number of edges from the node to the root. Denote d(p l ) the depth of the internal node corresponding to p l and d(q l ) the depth of the leaf corresponding to q l . A successful search requires 1 + d(p l ) comparisons, and an unsuccessful search requires d(q l ) comparisons. See Fig. 5. So, the expected number of comparisons is

$$\displaystyle\sum _{1\leq l\leq n}p_{l}\,(1 + d(p_{l})) +\displaystyle\sum _{0\leq l\leq n}q_{l}\,d(q_{l}).$$
(11)

The goal is to construct an optimal binary search tree that minimizes the expected number of comparisons carried out, which is (11).

Fig. 5
figure 5

A binary search tree with successful (round) and unsuccessful (rectangular) nodes

Let B i, j be the expected number of comparisons carried out in a optimal subtree containing the keys \(\mbox{ Key}_{i+1} < \mbox{ Key}_{2} < \cdots < \mbox{ Key}_{j}\). Observing that in a search the probability to search in the region between \(\mbox{ Key}_{i+1}\) and \(\mbox{ Key}_{j}\) is \(\sum _{l=i+1}^{j}p_{l} +\sum _{ l=i}^{j}q_{l}\) it is clear that the following recurrence holds:

$$B_{i,j} = \left \{\begin{array}{ll} 0, &\mbox{ if }i = j; \\ \displaystyle\sum _{l=i+1}^{j}p_{ l} +\displaystyle\sum _{ l=i}^{j}q_{ l} +\min\limits_{i<t\leq j}\left (B_{i,t-1} + B_{t,j}\right ),&\mbox{ if }i < j, \end{array} \right.$$
(12)

where the cost of the optimal binary search tree is B 0, n . Calculating B i, j requires O(ji) time, thus calculating all of the B i, j requires O(n 3) time.

2.6 Pyramidal Tours for the Traveling Salesman Problem

In the traveling salesman problem one is given a set of n “cities” \(V =\{ 1,\ldots,n\}\) as well as a distance matrix d[i, j]. Find a permutation t (“the tour”), such that

$$f(t) = d(t(n),t(1)) +\displaystyle\sum _{ i=1}^{n-1}d(t(i),t(i + 1))$$

is minimized. This problem is well studied and it is known to be \(\mathcal{N}\mathcal{P}\)-hard. (A good resource on the Traveling Salesman Problem is the “guided tour book” by Lawler, Lenstra, Rinnoy Kan, and Shmoys [79]). A tour t is said to be pyramidal if, t is of the form \(1,i_{1},i_{2},i_{3},\ldots n,j_{1},\ldots,j_{n-r-2}\), where \(1< i_{1}<i_{2}<i_{3}<\ldots n\mbox{ and }j_{1}>j_{2}>j_{3}>\ldots> j_{n-r-2}\). A pyramidal tour can be found by dynamic programming in \(\Theta ({n}^{2})\). To this end, let H[i, j] be the length of a shortest Hamiltonian path from i to j subject to the condition that the path goes from i to 1 in descending order followed by the rest in ascending order from 1 to j. The reader is encouraged to verify that by the principle of optimality the following recursion holds:

$$H[i,j] = \left \{\begin{array}{ll} H[i,j - 1] + d[j - 1,j]&\mbox{ for }i < j - 1,\\ \ &\ \\ \min_{k<i}\{H[i,k] + d[k,j]\} &\mbox{ for }i = j - 1,\\ \ &\ \\ H[i - 1,j] + d[i,i - 1] &\mbox{ for }i < j + 1,\\ \ &\ \\ \min_{k<j}\{H[k,j] + d[j,k]\} &\mbox{ for }i = j + 1. \end{array} \right.$$
(13)

Then the cost of shortest pyramidal tour is

$$ \min\{H[n,n - 1] + d[n - 1,n],H[n - 1,n] + d[n,n - 1]\}.$$

A matrix d is Monge if for all i < i′ and j < j′, \(d[i,j] + d[i',j'] \leq d[i',j] + d[i,j']\). If the matrix d is a Monge matrix then it is easy to see that there exists an optimal tour, which is pyramidal.

3 Open-ended Dynamic Programming: Work Functions

Dynamic programs are useful in decision making for problems where new request are constantly added, and updates need to be performed. The k-server problem, originally given by Manasse, McGeoch, and Sleator [82], is defined as follows: one is given k ≥ 2 mobile servers which reside in a metric space M. A sequence of requests is issued, where each request is specified by a point \(r \in M\). To “satisfy” this request, one of the servers must be moved to r, at a cost equal to the distance from its current location to r. (If a request is to a point that already has a server the cost is zero.) The goal is to minimize the total service cost. An algorithm \(\mathcal{A}\) for the k-server problem computes a solution which determines which server is moved at each step. Figure 6 gives a very simple example for the 2-server problem, i.e.,  the server problem for k = 2 in a metric space with only three points. Interestingly, a very similar metric space is powerful enough to model the noted “ski rental problem”, see Karlin, Manasse, Rudolph, and Sleator [68] for details.

Fig. 6
figure 6

The two points x and y are at distance 1 and a point z is at distance 2 from x and y. The request sequence is xyxyz. The solution, i.e., the positions of the servers for each request, are circled. The solution has a cost of 7, which is not optimal since it is better to move the server at point z for the first and last request

This problem can be solved by dynamic programming calculating work functions, when the length of the request sequence and all the requests are known in advance. This standard version of the problem is also called the offline version of the k-server problem. Credit for introducing work functions goes to Larmore and Chrobak [42].

Work functions provide information about the optimal cost of serving the past request sequence. For a request sequence \(\varrho\), by \(\omega _{\varrho }(X)\) one denotes the minimum cost of serving \(\varrho\) and ending in configuration X – an unordered k-tuples of points. The function \(\omega _{\varrho }\) is called the work function after request sequence \(\varrho\). The notation \(\omega\) is used to denote any work function \(\omega _{\varrho }\), for some request sequence \(\varrho\). Immediately from the definition of work functions it can be concluded that the optimal cost to service \(\varrho\) is \(\mathit{opt}(\varrho ) =\min\limits_{X}\omega _{\varrho }(X)\).

For given \(\varrho\), the work function \(\omega _{\varrho }\) can be computed using dynamic programming. Initially, \(\omega _{\epsilon }(X) = {S}^{0}X\), for each configuration X (\(\epsilon\) is the empty request sequence). For a non-empty request sequence \(\varrho\), if r is the last request in \(\varrho\), write \(\varrho =\sigma r\). Then \(\omega _{\varrho }\) can be computed recursively as \(\omega _{\varrho } =\omega _{\sigma }\wedge r\), where “\(\wedge \)” is the update operator defined as follows:

$$\displaystyle\begin{array}{rcl} \omega \wedge r(X)& \;=\;\;\min_{Y \ni r}\left \{\omega (Y ) + \mbox{ dist}(Y,X)\right \}&\end{array}$$
(14)

Here \(\mbox{ dist}(Y,X)\) denotes the minimum-matching distance between X and Y . Note that \(\left \vert \omega (X) -\omega (Y )\right \vert \leq XY\) for any work function \(\omega\) and any configurations X and Y . This inequality is called the Lipschitz property. A set of configurations \(S =\{ X_{1},X_{2},\ldots \}\) is said to support a work function \(\omega\) if, for any configuration Y , there exists some \(X \in S\) such that \(\omega (Y ) =\omega (X) + \mbox{ dist}(X,Y )\). If \(\omega\) is supported by a finite set (which it usually is), then there is a unique minimal set S which supports \(\omega\), which is called the work function support of \(\omega\). Note the following: If \(r \in X\), then \(\omega \wedge r(X) =\omega (X)\). If \(r\notin X\), let Y be the configuration that contains r and minimizes \(\omega (Y ) + Y X\), and let x be the point in X that is matched to r in the minimum matching between X and Y . Then \(\omega (Y ) + Y X =\omega (Y ) + rx + Y (X - x + r) \geq \omega (X - x + r) + rx\). Thus the update formula Eq. 14 can be rewritten as \(\omega \wedge r(X) =\min\limits_{x\in X}\left \{\omega (X - x + r) + rx\right \}\). Also note that in calculating the functions \(\omega _{\varrho }\) one need only keep track of the value of the function at their support. This is important as the number of configurations in the domain of \(\omega _{\varrho }\) grows as requests grows. Figure 7 shows how to calculate an optimal solution, the progression of \(\omega _{\varrho }\) and support for the example in Fig. 6.

Fig. 7
figure 7

Calculating the optimal solution using dynamic programming for the example in Fig. 6. The work functions \(\omega _{\varrho }\) are shown in the figure as values adjacent to the corresponding pairs of points. Equivalently, the work functions can be written into the traditional dynamic programming table (top). The support is marked by ovals in the figure and is underlined in the table. The minimum value of the last row is 4 – the optimal value

In practice, requests might be given one at a time and the algorithm then has to make a decision about which server to move before future requests are known. An algorithm \(\mathcal{A}\) is said to be online if its decisions are made without the knowledge of future requests. It is unlikely that such an algorithm would achieve optimality. Similar to approximation algorithms, the quality of the algorithm is measured by comparing against the offline cost: \(\mathcal{A}\) is C-competitive if the cost incurred by \(\mathcal{A}\) to service each request sequence \(\varrho\) is at most C times the optimal (offline) service cost for \(\varrho\), plus possibly an additive constant independent of \(\varrho\). The competitive ratio of \(\mathcal{A}\) is the smallest C for which \(\mathcal{A}\) is C-competitive. The competitive ratio is frequently used to study the performance of online algorithms for the k-server Problem, as well as other optimization problems. The reader is referred to the book of Borodin and El-Yaniv [30] for a comprehensive discussion of competitive analysis.

It is interesting to note that the work functions \(\omega _{\varrho }\) play a central role in the algorithm with current best competitiveness for the k-server problem: Work Function Algorithm. The Work Function Algorithm chooses its service of the request sequence \(\varrho\) as follows: Suppose that WFA is in configuration S, and that the current work function is \(\omega\). On request r, WFA chooses some \(x \in S\) which minimizes \(xr +\omega \wedge r(S - x + r)\), and moves the server from x to r. If there is more than one choice which minimizes that quantity, the choice is arbitrary. WFA can be seen as a “linear combination” of two greedy strategies. The first one, a short-sighted greedy, minimizes the cost xr in the given step. The second, a retrospective greedy, chooses the optimal configuration after r, that is, the configuration Sx + r that minimizes \(\omega \wedge r(S - x + r)\). The short-sighted greedy strategy is not competitive for any k. The retrospective greedy strategy is not competitive for \(k \geq 3\), and its competitive ratio for k = 2 is at least 3. The reader might verify that the service sequence depicted in Fig. 6 shows the steps of WFA for that example.

Table 5

Manasse, McGeoch, and Sleator [82], have proved the following:

Theorem 2

No online algorithm for k servers has competitive ratio smaller than k if a metric space has at least k + 1 points.

They also give the k-server conjecture which states that, for each k, there exists an online algorithm for k servers which is k-competitive in any metric space. For k > 2, this conjecture has been settled only in a number of special cases, including trees and spaces with at most k + 2 points. (cf. the work of Chrobak, Karloff, Payne, and Vishwanathan [44], the work of Chrobak and Larmore [41] and the work of Koutsoupias and Papadimitriou [72].) Koutsoupias and Papadimitriou have shown:

Theorem 3

The Work Function Algorithm is (2k − 1)-competitive for k servers in arbitrary metric spaces.

Thus a wide gap remains. Even some simple-looking special cases remain open, for example the 3-server problem on the circle, in the plane, or in 6-point spaces. Chrobak and Larmore [42] (see also [43]) prove:

Theorem 4

The Work Function Algorithm is 2-competitive for k = 2.

Bein, Chrobak, and Larmore [15] show:

Theorem 5

The Work Function Algorithm is 3-competitive for k = 3 if the metric space M is the Manhattan plane.

4 Intricate Dynamic Programming: Block Deletion in Quadratic Time

Sorting problems under various operations have been studied extensively, including work on sorting with prefix reversals, transpositions and block moves. This section contains an example from this realm and shows that intricate setup of dynamic programming can speed up dynamic programming schemes.

4.1 Preliminaries

Define a permutation of length n to be a list \(x = (x_{1},\ldots,x_{n})\) consisting of the integers \(\{1,\ldots,n\}\) where each number occurs exactly once. For any \(1 \leq i \leq j \leq n\) denote the sublist of x that starts at position i and ends with position j by \(x_{i\ldots j}\). A list y is a subsequence of x if y is obtained from x by deleting any number of elements. For example, (2, 3) is a subsequence of (2, 4, 3, 1), but not a sublist. Since x has no duplicate symbols, a subsequence of x is uniquely characterized by its set of items. By a slight abuse of notation, one identifies a subsequence with the set of its items. Define the closure of a subsequence of x to be the smallest sublist of x which contains it. For example, the closure of the subsequence (2, 3) of (2, 4, 3, 1) is the sublist (2, 4, 3). If A and A′ are subsequences of a list x, say that A and A′ are separated if the closures of A and A′ are disjoint.

A block deletion sequence for a subsequence y of x consists of a sequence \(A_{1},\ldots,A_{m}\) of disjoint non-empty subsequences of y such that

  1. 1.

    for all \(i = 2,\ldots,m\), A i is a block in \(y -\bigcup _{u=1}^{i-1}A_{u}\), and

  2. 2.

    \(y -\bigcup _{u=1}^{m}A_{u}\) is a monotone increasing list.

For example, a minimum length block deletion sequence for the list (1, 4, 2, 5, 3) consists of two steps. First delete the block (2), obtaining (1, 4, 5, 3), then delete the block (4, 5), obtaining the sorted list (1, 3). Figure 8 shows another example of a block deletion sequence. A complete block deletion sequence for a subsequence y of x consists of a block deletion sequence \(A_{1},\ldots,A_{m}\) of y such that \(y -\bigcup _{u=1}^{m}A_{u}\) is the empty list.

Fig. 8
figure 8

A block deletion sequence. There are five steps: \(A_{1},\ldots,A_{5}\)

4.2 A Dynamic Program for Complete Block Deletion

Consider first the complete block deletion problem for all sublists of x, which will be solved in quadratic time by dynamic programming. Once the O(n 2) answers to this problem are obtained, the original block deletion problem can be solved in quadratic time. The following three lemmas are used:

Lemma 1

If \(A_{1},\ldots,A_{m}\) is a block deletion sequence for a sublist y of x, and \(1 \leq u < v \leq m\) , then either A u and A v are separated, or A u is a subsequence of the closure of A v.

Proof

The closure of A u cannot contain any item of A v , since otherwise A u could not be deleted before A v . If all items of A v are before A u or all items of A v are after A u , then A u and A v are separated. If some items of A v are before A u and some items are after A u , then A u is a subsequence of the closure of A v .

Lemma 2

If \(A_{1},\ldots,A_{t},A_{t+1},\ldots,A_{m}\) is a block deletion sequence for a sublist y of x, and A t and A t+1 are separated, then A t and A t+1 may be transposed, i.e., \(A_{1},\ldots,A_{t+1},A_{t},\ldots,A_{m}\) is a block deletion sequence for y.

Proof

For any u, let \(y_{u} = x -\bigcup _{v<u}A_{v}\). By definition, A t is a block of y t , and A t + 1 is a block of \(y_{t+1} = y_{t} - A_{t}\). Since A t and A t + 1 are separated, A t + 1 is also a block of y t . Thus, A t + 1 can be deleted before A t .

Lemma 3

For any \(1 \leq i \leq j \leq n\) , if there is a complete block deletion sequence for \(x_{i\ldots j}\) of length m, then there is a complete block deletion sequence for \(x_{i\ldots j}\) of length m such that x i is deleted in the last step.

Proof

Let \(A_{1},\ldots,A_{m}\) be a complete block deletion sequence of \(x_{i\ldots j}\). Suppose that \(x_{i} \in A_{t}\) for some t < m. Since x i is deleted in the t th move of the sequence, all deletions after that must involve blocks whose first symbol occurs to the right of x i in x. That is, for any v > t, x i cannot be an item of the closure of A v , hence, by Lemma 1, A t and A v must be separated. By Lemma 2, one can transpose A t with A v for each v > t in turn, moving A t to the end of the deletion sequence.

Next is given a recurrence to compute the minimum length of a complete block deletion sequence. This recurrence will be used in Algorithm 1.

Theorem 6

Given a permutation x, let t i,j denote the minimum length of any complete block deletion sequence for the sublist \(x_{i\ldots j}\) . The values of t i,j can be computed inductively by the following: Set t i,i = 1, t i,j = 0 for i > j, and for i < j set

$$ t_{i,j}= \left\{\begin{array}{lll} \min\{{1+t_{i+1,j},\ t_{i+1,\ell-1} +t_{\ell,j}}\}, {\mathrm{if}} \exists\ \ell\in \{ { i +1,\ldots,j} \} {\mathrm{such\ that}}\ x_{\ell} = x_{i}+1, \\ 1 +t_{i+1,j}, \qquad\qquad\qquad\qquad\qquad {\mathrm{otherwise}.} \end{array} \right. $$
(15)

Proof

The proof is by induction on ji to show that t i, j is computed correctly. The base case is trivial, namely t i, i = 1, because it takes one block deletion to delete a sublist of length 1.

For the inductive step, assume that t i, j is the length of a minimum block deletion sequence for x i…j when ji < k. For ji = k, let m = t i, j and let \(A_{1},\ldots,A_{m}\) be a corresponding minimum length complete block deletion sequence of \(x_{i\ldots j}\). By Lemma 3, one can insist that x i = a is an item in A m . Consider now two cases based on whether there is an \(\ell\) with \(i <\ell\leq j\) such that \(x_{\ell} = a + 1\) (The reader might also consult Fig. 9).

Fig. 9
figure 9

The recurrence for the t i, j values

If the element a + 1 does not occur in the interval \(\{i + 1,\ldots,j\}\), then the element a is not part of a block in this interval and must be deleted by itself. So, \(A_{1},\ldots,A_{m-1}\) is a complete block deletion sequence of \(x_{i+1\ldots j}\). Note that it must be of optimum length, for if \(B_{1},\ldots,B_{r}\) were a shorter complete block deletion sequence of \(x_{i+1\ldots j}\), then \(B_{1},\ldots,B_{r},\{a\}\) would be shorter than \(A_{1},\ldots,A_{m}\). Thus, as \(j - (i + 1) = k - 1\), by the induction hypothesis \(t_{i+1,j} = m - 1\). So \(t_{i,j} = 1 + t_{i+1,j}\), which is optimum.

Now for the case that the element \(x_{\ell} = a + 1\) does occur in the interval \(\{i + 1,\ldots,j\}\). This means that a + 1 and possibly other larger values can be included in A m when element a is deleted. If element a + 1 is not included in A m then the same argument used in the previous paragraph shows that \(m = 1 + t_{i+1,j}\). If on the other hand a + 1 is included in A m , then by Lemma 1, for any \(t,(1 \leq t \leq m)\), A t is either completely before or completely after the element a + 1, since A m is deleted after A t . By Lemma 2, one can permute the indices so that, for some u < m,

  1. 1.

    if \(t \leq u\), then A t is a subsequence of \(x_{i+1\ldots \ell-1}\), and

  2. 2.

    if \(u < t \leq m\), then A t is a subsequence of \(x_{\ell+1\ldots j}\).

Consequently, \(A_{1},\ldots,A_{u}\) is a block deletion sequence for \(x_{i+1\ldots \ell-1}\) and \(A_{u+1},\ldots,A_{m} -\{ a\}\) is a block deletion sequence for \(x_{\ell+1\ldots j}\). Both of these block deletion sequences must be optimum for their respective intervals. That is, for example, if \(B_{1},\ldots,B_{r}\) were a shorter complete block deletion sequence for \(x_{i+1\ldots \ell-1}\), then \(B_{1},\ldots B_{r},A_{u+1},\ldots A_{m}\) would be a complete block deletion sequence for \(x_{i\ldots j}\), contradicting the minimality of m. By the induction hypothesis, as \(\ell-1 - (i + 1) < j - i\) and \(j-\ell < j - i\), it follows that \(t_{i+1,\ell-1} = u\) and \(t_{\ell,j} = m - u\), so \(t_{i,j} = t_{i+1,\ell-1} + t_{\ell,j} = u + (m - u) = m\).

The resulting dynamic programming algorithm BlockDeletion, which is derived from the recurrence of Theorem 6, is shown below.

Next the analysis of the run time of algorithm BlockDeletion is considered. Let \(z = (z_{1},\ldots,z_{n})\) be the inverse permutation of x, i.e., x i = k if and only if z k = i. Note that z can be computed in O(n) preprocessing time.

Theorem 7

Algorithm BlockDeletion has run time O(n2).

Proof

To prove the theorem one shows that if t u, v are already known for all \(i < u \leq v \leq j\), then t i, j can be computed in O(1) time. Let m = t i, j for i < j and let A t be the subsequence of \(x_{i\ldots j}\) that is deleted at step t of the complete block deletion sequence of \(x_{i\ldots j}\) of length m. By Lemma 3, assume that \(x_{i} \in A_{m}\). If \(\vert A_{m}\vert> 1\), the index \(\ell= z_{x_{i}+1}\), i.e., \(x_{\ell} = 1 + x_{i}\), can be found in O(1) time, since one has already spent O(n) preprocessing time to compute the array z. The recurrence thus takes O(1) time to execute for each i, j.

Note that to obtain the actual sequence of steps in the optimum complete block deletion sequence one can store appropriate pointers as the t i, j are computed.

4.3 Computing Block Deletion

The reader is reminded that for the block deletion problem one needs to find the minimum length sequence of block deletions to transform a permutation into a monotone increasing list. Next it is shown how one obtains a solution to the block deletion problem for x in O(n 2)-time given that all t i, j are known for \(1 \leq i \leq j \leq n\). Define a weighted acyclic directed graph G with one node for each \(i \in \left \{0,\ldots,n + 1\right \}\). There is an edge from i to j if and only if i < j and \(x_{i} < x_{j}\), and the weight of that edge is \(t_{i+1,j-1}\). Simply observe that there is a path \(\langle 0,i_{1},i_{2},\ldots,i_{k},n + 1\rangle\) in G exactly when \(x_{i_{1}},\ldots x_{i_{k}}\) is a monotone increasing list. Furthermore the weight of the edge \(\langle i_{\ell},i_{\ell+1}\rangle\) gives the minimum number of block deletions necessary to delete elements between position \(i_{\ell}\) and \(i_{\ell+1}\) in x. Thus if there is a block deletion sequence of x of length m, there must be a path from 0 to n + 1 in G of weight m, and vice-versa.

Using standard dynamic programming, a minimum weight path from 0 to n + 1 can be found in O(n 2) time. Let \(0 = i_{0} < i_{1} < \cdots < i_{\ell} = n + 1\) be such a minimum weight path, and let \(w =\sum _{ u=1}^{\ell}t_{i_{u-1}+1,i_{u}-1}\) be the weight of that minimum path. Since every deletion is a block deletion, the entire list can be deleted to a monotone list in w block deletions. Thus it follows:

Theorem 8

The block deletion problem can be solved in time O(n 2 ).

5 Total Monotonicity and Batch Scheduling

Although dynamic programming has been around for decades there have been more recent techniques for making dynamic programming more efficient. This section describes an example from scheduling, where such a technique is applied. Before the technique is shown, the simple traditional dynamic programming for the scheduling problem is given and then the dynamic programming speedup technique is described.

5.1 The Problem 1 | s − batch | ∑w i C i

Consider the batching problem where a set of jobs \(\mathcal{J} =\{ J_{i}\}\) with processing times p i > 0 and weights \(w_{i} \geq 0\), \(i = 1,\ldots,n\), must be scheduled on a single machine, and where \(\mathcal{J}\) must be partitioned into batches \(\mathcal{B}_{1},\ldots,\mathcal{B}_{r}\). All jobs in the same batch are run jointly and each job’s completion time is defined to be the completion time of its batch. One assumes that when a batch is scheduled it requires a setup time s = 1. The goal is to find a schedule that minimizes the sum of completion times \(\sum w_{i}C_{i}\), where C i denotes the completion time of J i in a given schedule. Given a sequence of jobs, a batching algorithm must assign every job J i to a batch. More formally, a feasible solution is an assignment of each job J i to the \(m_{i}^{th}\) batch, \(i \in \{ 1,\ldots,n\}\) (Fig. 10).

Fig. 10
figure 10

A batching example. Shown are two feasible schedules for a 5-job problem where processing times are \(p_{1} = 3,p_{2} = 1,p_{3} = 4,p_{4} = 2,p_{5} = 1\) and the weights are \(w_{1} = w_{4} = w_{5} = 1\) and \(w_{2} = w_{3} = 2\). The encircled values give the sum of weighted completion times of the depicted schedules

The problem considered has the jobs executed sequentially, thus the problem is more precisely referred to as the s-batch problem. There is a different version of the problem not studied here, where the jobs of a batch are executed in parallel, known as the p-batch problem. In that case, the length of a batch is the maximum of the processing times of its jobs. The s-batch is also denoted in \(\alpha \vert \beta \vert \gamma\) notation as the \(1\vert \mbox{ s-batch}\vert \sum w_{i}C_{i}\) problem. Brucker and Albers [6] showed that the \(1\vert \mbox{ s-batch}\vert \sum w_{i}C_{i}\) problem is \(\mathcal{N}\mathcal{P}\)-hard in the strong sense by giving a reduction from 3-PARTITION.

There is a large body of work on dynamic programming and batching; see the work of Baptiste [12], Baptiste and Jouglet [13], Brucker, Gladky, Hoogeveen, Kovalyov, Potts Tautenhahn, and van de Velde [36], Brucker, Kovalyov, Shafransky, and Werner [37], and Hoogeveen and Vestjens [64], as well as the scheduling text book by Peter Brucker [33]. Batching has wide application in manufacturing (see e.g., [34, 85, 98]), decision management (see, e.g., [74]), and scheduling in information technology (see e.g., [46]). More recent work on online batching is related to the TCP (Transmission Control Protocol) acknowledgment problem (see [20, 49, 67]).

5.2 List Batching

A much easier version of the problem is the list version of the problem where the order of the jobs is given, i.e., \(m_{i} \leq m_{j}\) if i < j. An example is given in Fig. 11.

Fig. 11
figure 11

List batching. Shown are three schedules for a 5-job problem where all weights are 1 and the processing requirements are p 1 = 3, p 2 = 1, p 3 = 4, p 4 = 2, p 1 = 1. The encircled values give the sum of weighted completion times of the depicted schedules

Assume that the jobs are \(1,\ldots,n\) and are given in this order. One can then reduce the list batching problem to a shortest path problem in the following manner: Construct a weighted directed acyclic graph G with nodes \(i = 1,\ldots,n\) (i.e.,  one node for each job) and add a dummy node 0. There is an edge (i, j) if and only if i < j. (See Fig. 12 for a schematic.) Let edge costs c i, j for i < j be defined as

$$c_{i,j} = \left (\displaystyle\sum _{\ell=i+1}^{n}w_{\ell}\right )\left (s +\displaystyle\sum _{ \ell =1}^{j}p_{\ell}\right ).$$
(16)

It is easily seen (see [6] for details) that the cost of path \(< 0,i_{1},i_{2},\ldots,i_{k},n>\) gives the \(\sum C_{i}w_{i}\) value of the schedule which batches at each job \(i_{1},i_{2},\ldots,i_{k}\). Conversely, any batching with cost A corresponds to a path in G with with path length A.

Fig. 12
figure 12

Reduction of the list batching problem to a path problem

A shortest path can be computed in time O(n 2) using the following dynamic program:

Let

$$E[\ell] = \mbox{ cost of the shortest path from}0\mbox{ to }\ell,$$

then

$$E[\ell] =\min\limits_{0\leq k<\ell}\{E[k] + c_{k,l}\}\mbox{ with }E[0] = 0,$$
(17)

which results in a table, in which elements can be computed row by row (see Fig. 13).

Fig. 13
figure 13

Dynamic programming tableau

In other words, the dynamic program computes the row minima of the \(n \times n\) matrix E, where

$$E[\ell,k] = \left \{\begin{array}{l} E[k] + c[k,\ell]\mbox{ if }\ell < k\\ \infty \mbox{ else } \end{array} \right.$$
(18)

with \(\ell= 1,\ldots n\) and \(k = 0,\ldots,n - 1\).

As it turns out it is not necessary to calculate all entries of E to calculate the optimal solution. Surprisingly, only O(n) have to be looked up throughout the entire calculation. The reason that this is possible is that E is a matrix with special properties discussed next.

5.3 The Monge Property and Total Monotonicity

Definition 1

A matrix A is Monge if for all i < i′ and j < j′,

$$A[i,j] + A[i',j'] \leq A[i',j] + A[i,j'].$$
(19)

Definition 2

A \(2 \times 2\) matrix is monotone if the rightmost minimum of the upper row is not to the right of the rightmost minimum of the lower row. More formally, \(\left [\begin{array}{ll} a&b\\ c &d \end{array} \right ]\) is monotone if \(b \leq a\) implies that \(d \leq c\).

Definition 3

A matrix A is called totally monotone if all \(2 \times 2\) dimensional submatrices are monotone.

Observation 1

Every Monge matrix is totally monotone.

The reader is referred to Fig. 14. Monge matrices occur routinely. For example in the batching to shortest path reduction, the cost matrix C is a Monge matrix:

Fig. 14
figure 14

Monge and monotonicity. Top left the Monge property is shown, which prohibits the situation top right. Thus row minima veer to the right (bottom figure)

Lemma 4

The matrix \(C = (c_{i,j})\) defined in ( 16 ) is Monge for all choices of \(p_{i},w_{i} \geq 0\) . Furthermore values can be queried in O(1) time after linear preprocessing.

Proof

Let \(W_{i} =\sum _{ \nu =1}^{i}w_{\nu }\) and \(P_{i} =\sum _{ \nu =1}^{i}p_{\nu }\) be the partial sum of the p i and w i values. Then

$$c[i,j] = c_{i,j} = (W_{n} - W_{i})(s + P_{j} - P_{i})$$

For i < i′ and j < j′

$$\displaystyle\begin{array}{rcl} c[i,j] + c[i',j']& -& c[i',j] - c[i,j']\end{array}$$
(20)
$$\displaystyle\begin{array}{rcl} & =& (P_{j'} - P_{j})(W_{i'} - W_{i})\end{array}$$
(21)
$$\displaystyle\begin{array}{rcl} & \geq & 0.\end{array}$$
(22)

Also, notice that these values can be queried in O(1) time after linear preprocessing by setting up arrays of partial sums for W i and P i in linear time.

Furthermore, the matrix E is also a Monge matrix:

Lemma 5

The matrix \(E = (E_{\ell,k})\) defined in ( 18 ) is Monge.

Proof

Monge is preserved under addition and taking the minimum.

Computing the row minima of a Monge (or totally monotone) matrix can be done trivially in \(O(n\log n)\), cf. Fig. 15. A complex recursive procedure by Agarwal, Klawe, Moran, Shor, and Wilber [3] known as the SMAWK algorithm (the name was derived using the initials of authors of the original paper in which the algorithm was introduced), which has linear run time.

Fig. 15
figure 15

Calculating all row minima of a Monge matrix. The algorithm finds the minimum of the middle row in linear time and then recurses on the upper left and lower right parts of the matrix

Note that the dynamic program of Fig. 13 is essentially used to calculate the row minima of E – a Monge matrix. However, the trivial \(O(n\log n)\) cannot be used here since that algorithm requires all elements of E to be available offline, i.e., before the computation begins. Instead, there is a protocol by which each element can be queried: Once the minimum of row 1 is known, then any element in column 1 can be generated in constant time; once the minimum of row 2 is known then any element of row 1 and 2 are “knowable”; and so forth. (See also Fig. 16.)

Fig. 16
figure 16

The “online” protocol of the tableau. Note that once the minimum of row 4 is known, column 4 is “knowable”

More formally the protocol is as follows:

  1. 1.

    For each row index \(\ell\) of E, there is a column index \(\gamma _{\ell}\) such that for \(k>\gamma _{\ell}\), \(E_{\ell,k} = \infty \). Furthermore, \(\gamma _{\ell} \leq \gamma _{\ell+1}\).

  2. 2.

    If \(k \leq \gamma _{\ell}\), then \(E[\ell,k]\) can be evaluated in O(1) time provided that the row minima of the first \(\ell\) rows are already known.

  3. 3.

    E is a totally monotone matrix.

Larmore and Schieber [77] have developed an algorithm that generalizes the SMAWK to run in linear time in this case as well. Their algorithm is also known as the LARSCH algorithm – again, as with the SMAWK algorithm, the name was derived using initials of authors of the original paper in which the algorithm was introduced in Larmore and Schieber [77]. Both SMAWK and LARSCH are important in dynamic programming speedup and are explained in the next section.

6 The SMAWK and LARSCH Algorithm

This section gives “ready to implement” descriptions of the SMAWK and LARSCH algorithms mentioned in the previous section.

6.1 The Matrix Searching Problem

The matrix searching problem is the problem of finding all row minima of a given matrix M. If n, m are the number of rows columns, respectively, of M, the problem clearly takes \(\Theta (nm)\) time in the worst case. However, if M is totally monotone the problem can be solved in O(n + m) time using the SMAWK algorithm [3].

SMAWK is recursive, but uses two kinds of recursion, called INTERPOLATE and REDUCE, which are described in the following. Let \(1 \leq J(i) \leq m\) be the index of the minimum element of the i th row.Footnote 1 Since M is totally monotone, \(J(i) \leq J(k)\) for any i < k.

  1. 1.

    For small cases, SMAWK uses a trivial algorithm. If m = 1, the problem is trivial. If n = 1, simply use linear search.

  2. 2.

    If \(n \geq 2\), then INTERPOLATE can be used. Simply let M′ be the \(\left \lfloor n/2\right \rfloor \times m\) submatrix of M consisting of all even indexed rows of M. Recursively, find all row minima of M′, and then use linear search to find the remaining minima of M in O(n + m) time.

  3. 3.

    If m > n, then REDUCE can be used. The set \(\left \{J(i)\right \}\) clearly has cardinality at most n. REDUCE selects a subset of the columns of M which has cardinality at most n, and which includes Column J(i) for all \(1 \leq i \leq n\). Let M′ be the submatrix of M consisting of all the columns selected by REDUCE. Then, recursively, find all row minima of M′. These will exactly be the row minima of M. The time required to select the columns of M′ is O(n + m).

SMAWK then operates by alternating the two kinds of recursion. If the initial matrix is \(n \times n\), then INTERPOLATE reduces to the problem on a matrix of size roughly \(n/2 \times n\), and then REDUCE reduces to the problem on a matrix of size roughly \(n/2 \times n/2\), and so forth.

Time Complexity

Let T(n) be the time required by SMAWK to find all row minima of an \(n \times n\) totally monotone matrix. Applying both INTERPOLATE and REDUCE, obtain the recurrence

$$T(n) = O(n) + T(n/2)$$

and thus T(n) = O(n). By a slight generalization of the recurrence, one can show that SMAWK solves the problem for an \(n \times m\) matrix in O(n + m) time.

INTERPOLATE is explained using the code below.

Code for INTERPOLATE

The total number of comparisons needed to find the minima of all odd rows is at most n − 1, as illustrated in Fig. 17 below.

Fig. 17
figure 17

INTERPOLATE. M′ consists of the hatched rows, and the minima of those rows are indicated by black dots. By the monotonicity property of J, only search the interval indicated in black to find the minimum of any odd numbered row

Now for REDUCE. The procedure REDUCE operates by maintaining a stack, where each item on the stack is a column (actually, a column index). Initially the stack is empty, and columns are pushed onto the stack one at a time, starting with Column 1. The capacity of the stack is n, the number of rows.

But before any given column is pushed onto the stack, any number (zero or more) of earlier columns are popped off the stack. A column is popped if it is determined that it is dominated, which implies that none of its entries can be the minimum of any row. In some cases, if the stack is full, a new column will not be pushed.

At the end, those columns which remain on the stack form the matrix M′.

Dominance

There is a loop invariant, namely that if the stack size is at least i, and if S[i] = j, then all entries of Column j above Row i are known to be useless, i.e., will not be the minima of any row. Formally, \(M[i',S[i - 1]] \leq M[i',j]\) for all \(1 \leq i' < i\).

Suppose that Column k is the next column to be possibly pushed onto the stack. If the stack is empty, then \(S[1] \leftarrow k\), and one is done. Otherwise, it must be checked to see whether the top column in the stack is dominated. Let i be the current size of the stack, and let S[i] = k. Then, necessarily, \(i \leq j < k\). Column j is dominated if M[i, k] < M[i, j]. The reason is that \(M[i',j] \geq M[i',S[i - 1]]\) for any i′ < i by the loop invariant, and that \(M[i',j] < M[i',k]\) for all \(i' \geq i\) by monotonicity.

If the stack is popped, then the new top is tested for being dominated. This continues until the test fails. If the stack size is currently less than n, k is pushed onto the stack; otherwise, Column k is useless and is discarded.

Code for REDUCE

The total number of comparisons needed to execute REDUCE (other than the recursive call) is at most 2m. One see this using an amortization argument. Each column is given two credits initially. When a column is pushed onto the stack, it spends one credit, and when it is popped off the stack it spends one credit. Each of those two events requires at most one comparison. Thus, the total number of comparisons is at most 2m.

Figure 18 shows an example of an \(8 \times 16\) matrix M, where the shaded columns survive to form M′.

Fig. 18
figure 18

REDUCE. M′ consists of the hatched columns. All row minima, indicated by black dots, lie within those columns, although possibly not all columns contain minima

6.2 The Online Matrix Searching Problem

Define an almost lower triangular matrix to be matrix M of n rows, numbered \(1\ldots n\), and m columns such that the length of the rows increases as n increases. More formally, let the columns be numbered \(0\ldots m - 1\). Then there must exist n row lengths, \(1 \leq \ell [1] \leq \ell [2] \leq \cdots \leq \ell [n] = m\), such that M[i, j] is defined if and only if \(1 \leq i \leq n\) and \(0 \leq j <\ell [i]\). If \(\ell[i] = i\) for all i, then M is a standard lower triangular matrix.

Given an almost lower triangular matrix, let J(i) be the column index of the minimum entry in the i th row of M. (Ties are broken arbitrarily). The online matrix search problem is the problem of finding all row minima of M, with the condition that a given entry M[i, j] is available if and only if J(k) has been computed for every k such that \(\ell[k] < j\). For example, if M is a standard lower triangular matrix, M[i, j] is available if and only if J[k] has been computed for all \(k \leq j\). Note that M[i, 0] is available initially.

Assume that the computation is done by a process B which is in communication with a supervisor A. The online computation then proceeds as follows.

  • A grants permission for B to see Column 0.

  • B reports J[1] (which is certainly 0) to A.

  • A grants permission for B to see Column 1.

  • B reports J[2] to A.

  • A grants permission for B to see Column 1.

  • B reports J[3] to A.

  • etc.

Note that B is unable to compute J(i) until it has seen Columns 0 through i − 1, since the minimum of Row i could be in any of those columns. Conversely, one must have computed J(1) through J(i − 1) in order to be allowed to see those columns. Thus, the order of the events in the above computation is strict.

6.3 Algorithm LARSCH

In general, it takes O(nm) time to solve the online matrix search problem. However, if M is totally monotone, the values of J are monotone increasing, and the online matrix search problem can be solved in O(n + m) time by using an online algorithm LARSCH, which is the online version of SMAWK, with some adaptations.

Let M be the input matrix. LARSCH works using a chain of processes, each of which is in communication with its supervisor, the process above it in the chain. The supervisor of the top process is presumably an application program. Refer to the process below a process as its child.

Each process in the chain works an instance of the online matrix searching problem. The top process works the instance given by the application. Each process reports results interleaved with messages received from its supervisor.

Each process P works the problem on a strictly monotone almost lower triangular matrix M P , which is a submatrix of M. If Q is the child of P, then P creates a submatrix M Q of M P and passes that submatrix to Q. Of course, P never passes the entries of the matrix to Q; rather, it tells Q which entries of M it is allowed to see. For clarity assume that each process uses its own local row and column indices, but is aware of the global indices of each entry. For example, in example calculation in Table 5, M 5[3, 2] = M[15, 11].

Table 5 LARSCH algorithm example: INTERPOLATE and REDUCE

Alternating Types.

In the following detailed description of LARSCH, assume that the initial matrix M is standard lower triangular. (The algorithm can easily be generalized to cover other cases.)

Let \(P_{0},P_{1},\ldots P_{h}\), \(h = 2\left (\left \lfloor \log _{2}(n + 1)\right \rfloor - 1\right )\), be the processes, where P t + 1 is the child of P t . The processes are of alternating types. If t is even, say that P t has standard type, while if t is odd, say that P t has stretched type.

Let M t be the submatrix of M visible to P t . M 0 = M. If t is odd, then M t has \(n_{t} = \left \lfloor {2}^{-t/2}\left (n - {2}^{t/2} - 1\right )\right \rfloor \) rows and \(m_{t} = 2n_{t}\) columns, and the length of its \({i}^{\mathrm{th}}\) row is 2i. If \(t \geq 2\) is even, the M t has \(n_{t} = n_{t-1}\) rows and \(m_{t} = n_{t}\) columns, and the length of its i th row is i.

The choice of M t + 1 as a submatrix of M t is illustrated below. If t is even, then the rows of M t + 1 are rows \(3,5,7,\ldots \left \lfloor \frac{n-1} {2} \right \rfloor \) of M t , ie. all rows of odd index other than 1. Row i of M t + 1 consists of all but the last entry of Row 2i + 1 of M t .

If t is odd, then M t + 1 has the same number of rows as M t , but only half the columns. Since there are n t rows in M t , the number of possible columns of M t which contain values of J cannot exceed n t . The REDUCE procedure of LARSCH chooses exactly n t columns of M t in a way that makes certain that all columns which contain row minima of M t are chosen. The figure below illustrates one possible choice (Fig. 19). The shaded entries are passed to M t + 1. Note that not every entry of a selected column is part of M t + 1.

Fig. 19
figure 19

Shaded entries of M t indicate the submatrix M t + 1 passed to the child process. Left: odd t. Right: even t

6.4 Standard Type Process: P t for tEven (INTERPOLATE)

Code for a Process of Standard Type: P t for t Even.

6.5 Standard Type Process: P t for tOdd (REDUCE)

A stack called the column stack is maintained. Let top be the number of items (which are column indices) stored in the stack, and let S[i] be the ith entry (counting from the bottom) of the stack. That is, S[i] is defined for all \(1 \leq i \leq top\).

Each time process P t is able to view a new column, if the column stack is not empty, it pops all columns which are dominated by the new column off the stack, and then pushes the new column. The domination rule is that Column S[i] is dominated by the new column, say Column j, if \(M_{t}[i,j] < M_{t}[i,S[i]]\). Of course, it is possible that \(j \geq 2i\), in which case M t [i, j] is taken to be infinity.

Code for a Process of Stretched Type: P t for t Odd.

How may a new column possibly dominate existing columns on the column stack? Lines 2, 3, and 4 of the code below are the detail of Line 5 and Line 7 of the code above, while Lines 5 and 6 are the detail of Line 6 and Line 8 of the code above.

Code for Popping and Pushing the Column Stack.

7 The Quadrangle Inequality and Binary Search Trees

Another type of speedup is based in the Knuth-Yao quadrangle inequality. This section discusses this kind of speedup and the relation with SMAWK/LARSCH speedup.

7.1 Background

Recall construction of optimal binary search trees discussed in Sect. 2.5. Gilbert and Moore [59] gave a O(n 3) time algorithm. More than a decade later, in 1971, it was noticed by Knuth [71] that, using a complicated amortization argument, the B i, j can all be computed using only O(n 2) time. Around another decade later, in the early 1980s, Yao (see [96, 97]) simplified Knuth’s proof and, in the process, showed that this dynamic programming speedup worked for a large class of problems satisfying a quadrangle inequality property. Many other authors then used the Knuth-Yao technique, either implicitly or explicitly, to speed up different dynamic programming problems. For this see for example the work of Wessner [93], the work of Atallah, Kosaraju, Larmore, Miller, and Teng [9] and Bar-No and Ladner [14].

Even though both the SMAWK algorithm and the Knuth-Yao (KY) speedup (best described in [71, 96, 97]) use an implicit quadrangle inequality in their associated matrices, on second glance, they seem quite different from each other. In the SMAWK technique, the quadrangle inequality is on the entries of a given \(m \times n\) input matrix, which can be any totally monotone matrix. The KY technique, by contrast, uses a quadrangle inequality in the upper-triangular \(n \times n\) matrix B. That is, it uses the QI property of its result matrix to speed up the evaluation, via dynamic programming, of the entries in the same result matrix. Aggarwal and Park [2] demonstrated a relationship between the KY problem and totally-monotone matrices by building a 3-D monotone matrix based on the KY problem and then using an algorithm due to Wilber [94] to find tube minima in that 3-D matrix. They left as an open question the possibility of using SMAWK directly to solve the KY problem.

Definition 4

A two dimensional upper triangular matrix A, \(0 \leq i \leq j \leq n\) satisfies the quadrangle inequality (QI) if for all \(i \leq i' \leq j \leq j'\),

$$A(i,j) + A(i',j') \leq A(i',j) + A(i,j').$$
(23)

Observation 2

A Monge matrix satisfies a quadrangle inequality, but a totally monotone matrix may not.

Yao’s result (of [96]) was formulated as follows: For \(0 \leq i \leq j \leq n\) let w(i, j) be a given function and

$$B_{i,j} = \left \{\begin{array}{ll} 0, &\mbox{ if }i = j;\\ w(i, j) +\min\limits_{ i<t\leq j}\left (B_{i,t-1} + B_{t,j}\right ),&\mbox{ if }i < j. \end{array} \right.$$
(24)

Definition 5

w(i, j) is monotone in the lattice of intervals if \([i,j] \subseteq [i',j']\) implies \(w(i,j) \leq w(i',j')\).

As an example, it is not difficult to see that the \(w(i,j) =\sum _{ l=i+1}^{j}p_{l} +\sum _{ l=i}^{j}q_{l}\) of the BST recurrence (12) satisfies the quadrangle inequality and is monotone in the lattice of intervals.

Definition 6

Let

$$K_{B}(i,j) =\max \{ t\, :\, w(i,j) + B_{i,t-1} + B_{t,j} = B_{i,j}\},$$

i.e., the largest index which achieves the minimum in (24).

Yao then proves two Lemmas (see Fig. 21 for an example):

Lemma 6 (Lemma 2.1 in [96])

If w(i,j) satisfies the quadrangle inequality as defined in Definition 4, and is also monotone on the lattice of intervals, then the B i,j defined in ( 24 ) also satisfy the quadrangle inequality.

Lemma 7 (Lemma 2.2 in [96])

If the function B i,j defined in ( 24 ) satisfies the quadrangle inequality then

$$K_{B}(i,j) \leq K_{B}(i,j + 1) \leq K_{B}(i + 1,j + 1)\qquad \forall \ i < j.$$

Lemma 6 proves that a QI in the w(i, j) implies a QI in the B i, j . Suppose then that one evaluates the values of the B i, j in the order \(d = 1,2,\ldots,n\), where, for each fixed d, one evaluates all of B i, i + d , \(i = 0,1,\ldots,n - d\). Then Lemma 7 says that B i, i + d can be evaluated in time \(O(K_{B}(i + 1,i + d) - K_{B}(i,i + d - 1))\). Note that

$$\displaystyle\sum _{i=0}^{n-d}(K_{ B}(i + 1,i + d) - K_{B}(i,i + d - 1)) \leq n,$$

and thus all entries for fixed d can be calculated in O(n) time. Summing over all d, it follows that all B i, j can be obtained in O(n 2) time.

As mentioned, Lemma 7 and the resultant O(n 2) running time have long been viewed as unrelated to the SMAWK algorithm. While they seem somewhat similar (a QI leading to an order of magnitude speedup) they appeared not to be directly connected until very recently. In the next section it is shown how to solve the Knuth-Yao problem directly using decompositions into total monotone matrices.

7.2 Decomposition Techniques

Definition 7

For \(1 \leq d \leq n\) define the \((n - d + 1) \times (n + 1)\) matrix D d by

$$D_{i,t}^{d} = \left \{\begin{array}{ll} w(i,i + d) + B_{i,t-1} + B_{t,i+d},&\mbox{ if }0 \leq i < t \leq i + d \leq n; \\ \infty &\mbox{ otherwise}. \end{array} \right.$$
(25)

Figure 20 illustrates a first decomposition. Note that (24) immediately implies

$$B_{i,i+d} =\min\limits_{0\leq t\leq n}D_{i,t}^{d}$$
(26)

so finding the row-minima of D d yields B i, i + d , \(i = 0,\ldots,n - d\). Put another way, the B i, j entries on diagonal d = ji are exactly the row-minima of matrix D d.

Fig. 20
figure 20

The top figure on shows the B i, j matrix for n = 12. Each diagonal, d = ji, in the matrix will correspond to a totally monotone matrix D d. The minimal item of row i in D d will be the value B i, i + d . The other figure shows D 8

Lemma 8

If w(i,j) and the function B i,j defined in ( 24 ) satisfies the QI then, for each d \((1 \leq d \leq n)\) , D d is a totally monotone matrix.

Proof

It suffices to prove that

$$D_{i,t}^{d} + D_{ i+1,t+1}^{d} \leq D_{ i+1,t}^{d} + D_{ i,t+1}^{d}$$
(27)

Note that if i + 1 < t < i + d, then from Lemma 6,

$$B_{i,t-1} + B_{i+1,t} \leq B_{i+1,t-1} + B_{i,t}$$
(28)

and

$$B_{t,i+d} + B_{t+1,i+1+d} \leq B_{t+1,i+d} + B_{t,i+1+d}.$$
(29)

Thus,

$$\begin{array}{c@{\;\;}l} \;\;&D_{i,t}^{d} + D_{i+1,t+1}^{d} \\ =\;\;&\left [w(i,i + d) + B_{i,t-1} + B_{t,i+d}\right ] + \left [w(i + 1,i + 1 + d) + B_{i+1,t} + B_{t+1,i+1+d}\right ] \\ =\;\;&w(i,i + d) + w(i + 1,i + 1 + d) + \left [B_{i,t-1} + B_{i+1,t}\right ] + \left [B_{t,i+d} + B_{t+1,i+1+d}\right ] \\ \leq \;\;&w(i,i + d) + w(i + 1,i + 1 + d) + \left [B_{i+1,t-1} + B_{i,t}\right ] + \left [B_{t+1,i+d} + B_{t,i+1+d}\right ] \\ =\;\;&\left [w(i + 1,i + 1 + d) + B_{i+1,t-1} + B_{t,i+1+d}\right ] + \left [w(i,i + d) + B_{i,t} + B_{t+1,i+d}\right ] \\ =\;\;&D_{i+1,t}^{d} + D_{i,t+1}^{d} \end{array}$$

and (27) is correct (where it is noted that the right hand side is \(\infty \) if \(i + 1\not<t\) or \(t\not<i + d\)).

Lemma 9

Assuming that all of the row-minima of \({D}^{1},{D}^{2},\ldots,{D}^{d-1}\) have already been calculated, all of the row-minima of D d can be calculated using the SMAWK algorithm in O(n) time.

Proof

From the previous lemma, D d is a totally monotone matrix. Also, by definition, its entries can be calculated in O(1) time, using the previously calculated row-minima of D d′ where d′ < d. Thus SMAWK can be applied.

Combined with (26) this immediately gives a new O(n 2) algorithm for solving the KY problem; just run SMAWK on the D d in the order \(d = 1,2,\ldots,n\) and report all of the row-minima.

7.3 Online Decomposition

The online problem restricted to the optimal binary search tree would be to construct the OBST for items \(\mbox{ Key}_{L+1},\ldots,\mbox{ Key}_{R}\), and, at each step, add either \(\mbox{ Key}_{R+1}\), a new key to the right, or \(\mbox{ Key}_{L}\), a new key to the left. Every time a new element is added, it is desired to update the B i, j (dynamic programming) table and thereby construct the optimal binary search tree of the new full set of elements. (See Fig. 21.)

Fig. 21
figure 21

An example of the online case for optimal binary search trees where \((p_{4},p_{5},p_{6},p_{7}) = (2,69,38,84)\) and \((q_{3},q_{4},q_{5},q_{6},q_{7}) = (20,69,31,55,16)\). The left table contains the B i, j values; the right one, the K B (i, j) values. The unshaded entries in the table are for the problem restricted to only keys 5, 6. The dark gray cells are the entries added to the table when key 7 is added to the right. The light gray cells are the entries added when key 4 is added to the left. The corresponding optimal binary search trees are also given, where circles correspond to successful searches and squares to unsuccessful ones. The values in the nodes are the weights of the nodes (not their keys)

KY speedup cannot be used to do this. The reason that the speedup fails is that the KY speedup is actually an amortization over the evaluation of all entries when done in a particular order. In the online case, adding a new item n to previously existing items \(1,2,\ldots,n - 1\) requires using (24) to compute the n new entries B i, n , in the fixed order \(i = n,n - 1,\ldots,1,0\) and it is not difficult to construct an example in which calculating these new entries in this order using (24) requires O(n 2) work.

Neither can the decomposition technique from the previous section solve the online problem. To see why, suppose that items \(1,\ldots,n\) have previously been given, new item n + 1 has just been added, and one needs to calculate the values \(B_{i,n+1}\) for \(i = 0,\ldots,n + 1\). In this formulation it would correspond to adding a new bottom row to every matrix D d and creating a new matrix D n + 1 and one would need to find the row-minima of all of the n new bottom rows. Unfortunately, the SMAWK algorithm only works on the rows of matrices all at once and cannot help to find the row-minima of a single new row.

Note that for the online problem it is certainly possible to recompute the entire table; however this comes at the price of O(n 2) time, where n = RL is the number of keys currently in the table, leading to a total running time of O(n 3) to insert all of the keys. Of interest here is the question of whether one can maintain the speedup while inserting the keys in an online fashion. The goal is an algorithm in which a sequence of n online key insertions will result in a worst case O(n) per step to maintain an optimal tree, yielding an overall run time of O(n 2) (Fig. 22).

Fig. 22
figure 22

The top figure shows the B i, j matrix for n = 12. Each column in the B i, j matrix will correspond to a totally monotone matrix R j. The minimal element of row i in R j will be the value B i, j . The other figure shows R 8

To this end now a second decomposition. It is indexed by the leftmost element seen so far. See Fig. 23.

Fig. 23
figure 23

The top figure on the left shows the B i, j matrix for n = 12. Each row in the B i, j matrix will correspond to a totally monotone matrix L i. The minimal element of row j in L i will be the value B i, j . The other figure shows L 8

Definition 8

For \(0 \leq i < n\) define the \((n - i) \times (n - i)\) matrix L i by

$$L_{j,t}^{i} = \left \{\begin{array}{ll} w(i,j) + B_{i,t-1} + B_{t,j},&\mbox{ if }i < t \leq j \leq n; \\ \infty, &\mbox{ otherwise.} \end{array} \right.$$
(30)

(For convenience, set the row and column indices to run from \((i + 1)\ldots n\) and not \(0\ldots (n - i - 1)\).) Note that (24) immediately implies

$$B_{i,j} =\min\limits_{i<t\leq n}L_{j,t}^{i}$$
(31)

so finding the row-minima of L i yields B i, j for \(j = i + 1,\ldots,n\). Put another way, the B i, j entries in row i are exactly the row minima of matrix L i.

Lemma 10

If the function defined in ( 24 ) satisfies the QI then R j (resp. L i ) are totally monotone matrices for each fixed j (resp. i).

Proof

The proofs are very similar to that of Lemma 8. To prove R j is totally monotone, note that if i + 1 < t < j, one can again use (28); writing the entries from (28) in boldface gives

$$\begin{array}{c@{\;\;}l} \;\;&R_{i,t}^{j} + R_{i+1,t+1}^{j} \\ =\;\;&\left [w(i,j) + \mathbf{B_{i,t-1}} + B_{t,j}\right ] + \left [w(i + 1,j) + \mathbf{B_{i+1,t}} + B_{t+1,j}\right ] \\ \leq \;\;&\left [w(i + 1,j) + \mathbf{B_{i+1,t-1}} + B_{t,j}\right ] + \left [w(i,j) + \mathbf{B_{i,t}} + B_{t+1,j}\right ] \\ =\;\;&R_{i+1,t}^{j} + R_{i,t+1}^{j}\end{array}$$

and thus R j is Monge (where the right hand side is \(\infty \) if \(i + 1\not<t\)) and thus totally monotone. To prove L i is totally monotone, if i < t < j then again use (28) (with j replaced by j + 1) to get

$$\begin{array}{c@{\;\;}l} \;\;&L_{j,t}^{i} + L_{j+1,t+1}^{i} \\ =\;\;&\left [w(i,j) + B_{i,t-1} + \mathbf{B_{t,j}}\right ] + \left [w(i,j + 1) + B_{i,t} + \mathbf{B_{t+1,j+1}}\right ] \\ \leq \;\;&\left [w(i,j + 1) + B_{i,t-1} + \mathbf{B_{t,j+1}}\right ] + \left [w(i,j) + B_{i,t} + \mathbf{B_{t+1,j}}\right ] \\ =\;\;&L_{j+1,t}^{i} + L_{j,t+1}^{i} \end{array}$$

and thus L i is Monge (where the right hand side is \(\infty \) if \(t\not<j\)) and thus totally monotone.

Note that this decomposition immediately imply a new proof of Lemma 7 (Lemma 2.2 in [96]) which states that

$$K_{B}(i,j) \leq K_{B}(i,j + 1) \leq K_{B}(i + 1,j + 1).$$
(32)

To see this note that \(K_{B}(i,j + 1)\) is the location of the rightmost row-minimum of row i in matrix R j + 1, while \(K_{B}(i + 1,j + 1)\) is the location of the rightmost row-minimum of row i + 1 in matrix \({R}^{j+1}\). Thus, the definition of total monotonicity immediately gives

$$K_{B}(i,j + 1) \leq K_{B}(i + 1,j + 1).$$
(33)

Similarly, K B (i, j) is the rightmost row-minimum of row j in L i while K B (i, j + 1) is the location of the rightmost row-minimum of row j + 1 in L i. Thus

$$K_{B}(i,j) \leq K_{B}(i,j + 1).$$
(34)

Combining (33) and (34) yields (32). Since the actual speedup in the KY technique comes from an amortization argument based on (32), it follows that the original KY-speedup itself is also a consequence of total monotonicity.

Up to this point it is not clear how to actually calculate the B i, j using the R j and L i. Note first that even though the R j are totally monotone, their row minima cannot be calculated using the SMAWK algorithm. This is because, for \(0 \leq i < t \leq j\), the value of entry \(R_{i,t}^{j} = w(i,j) + B_{i,t-1} + B_{t,j}\), which is dependent upon \(B_{t,j}\) which is itself the row-minimum of row t in the same matrix R j. Thus, the values of the entries of R j depend upon other entries in R j which is something that SMAWK does not allow. The same problem occurs with the L i.

But despite this dependence, the LARSCH algorithm can still be used to find the row-minima of the R j. Recall that to execute the LARSCH algorithm one needs only that the matrix X satisfy the following conditions:

  1. 1.

    X is an \(n \times m\) totally monotone matrix.

  2. 2.

    For each row index i of X, there is a column index C i such that for j > C i , \(X_{i,j} = \infty \). Furthermore, \(C_{i} \leq C_{i+1}\).

  3. 3.

    If \(j \leq C_{i}\), then X i, j can be evaluated in O(1) time provided that the row minima of the first i − 1 rows are already known.

If these conditions are satisfied, the LARSCH algorithm then calculates all of the row minima of X in O(n + m) time. This algorithm can now be used to derive

Lemma 11

  • Given that all values B i′,j, \(i < i' \leq j \leq n\) have already been calculated, all of the row-minima of L i can be calculated in O(n − i) time.

  • Given that all values B i,j′, \(0 \leq i \leq j' < j\) have already been calculated, all of the row-minima of R j can be calculated in O(j) time.

Proof

For the first part, it is easy to see that L i satisfies the first two conditions required by the LARSCH algorithm with C j = j. For the third condition, note that, for \(i < t \leq j\), \(L_{j,t}^{i} = w(i,j) + B_{i,t-1} + B_{t,j}\). The values w(i, j) and B t, j are already known and can be retrieved in O(1) time. B i, t − 1 is the minimum of row t − 1 of L i but, since we are assuming \(t \leq j\), this means that B i, t − 1 is the minimum of an earlier row in L i, and the third LARSCH condition is satisfied. Thus, all of the row-minima of the \((n - i) \times (n - i)\) matrix L i can be calculated in O(ni) time.

For the second part set X to be the \((j + 1) \times (j + 1)\) matrix defined by \(X_{i,t} = R_{j-i,j-t}^{j}\). Then X satisfies the first two LARSCH conditions with C i = i − 1. For the third condition note that \(X_{i,t} = R_{j-i,j-t}^{j} = w(j - i,j) + B_{j-i,j-t-1} + B_{j-t,j}\). The values w(ji, j) and \(B_{j-i,j-t-1}\) are already known and can be calculated in O(1) time. B jt, j is the row minima of row t of X; but, since we are assuming \(t \leq C_{i} = i - 1\) this means that B jt, j is the row minima of an earlier row in X so the third LARSCH condition is satisfied. Thus, all of the row-minima of X and equivalently R j can be calculated in O(j) time.

Note that Lemma 11 immediately solves the “right-online” and “left-online” problems. Given the new values \(w(i,R + 1)\) for \(L \leq i \leq R + 1\), simply find the row minima of R R + 1 in time O(RL). Given the new values w(L − 1, j) for \(L - 1 \leq j \leq R\), simply find the row minima of L L − 1. Therefore it was just shown that any dynamic programming problem for which the KY speedup can statically improve run time from O(n 3) to O(n 2) time can be solved in an online fashion in O(n) time per step. That is, online processing incurs no penalty compared to static processing. In particular, the optimum binary search tree can be maintained in O(n) time per step as nodes are added to both its left and right.

At this point note that decompositions L i could also be derived by careful cutting of the 3-D monotone matrices of Aggarwal and Park [2] along particular planes. Aggarwal and Park [2] used an algorithm of Wilber [94] (derived for finding the maxima of certain concave-sequences) to find various tube maxima of their matrices, leading to another O(n 2) algorithm for solving the KY-problem. In fact, even though their algorithm was presented as a static algorithm, careful decomposition of what they do permits using it to solve what is called here the left-online KY-problem. A symmetry argument could then yield a right-online algorithm. This never seems to have been noted in the literature, though.

8 Conclusion

Too small an academic community is aware of the many advanced tools available related to dynamic programming. Routinely, applications are solved by simple-minded dynamic programs, where much faster solutions are possible. In fact, for many massively large problems arising in Molecular Biology, for example, a quadratic solution might be completely useless, equivalent to no solution at all. It is the hope of the author that this book chapter will open up some these advanced techniques to a larger community of scientists.

As well the use of dynamic programming in online optimization and by extension the concept of work functions is not as widely embraced as is desirable. Many online algorithms are ad-hoc and work functions make it possible to “de-adhocify” the construction of competitive and efficient algorithms in this setting. The very recent concept of knowledge states (see [23]), which has dynamic programming at its core, makes it possible to derive online algorithms even in the randomized case in a systematic way.

Readers are invited to go beyond the obvious when using dynamic programming and to avail themselves of these powerful techniques.

9 Further Reading

Introductions to dynamic programming with numerous elementary examples can be found in the textbooks by Brassard and Gilles [32], Cormen, Leiserson, Rivest, and Stein [45], Baase and van Gelder [10] and Dasgupta, Papadimitriou, and Vazarani [47].

The classical treatises by Bellmann [24] and [25] are still relevant today though the language is somewhat antiquated and the applications outdated. Other general reading is Dreyfus and Law [50], Stokey, Lucas, and Prescott [88], Bertsekas [28], Denardo [48], Meyn [83], and more recently Sniedovich [87]. One classical algorithm worth mentioning is the algorithm by Viterbi [90] for Hidden Markov Model inference. Many problems in areas such as digital communications can be cast in the Hidden Markov Model. Another classic is Hu and Tucker [65] on alphabetic trees. Bellmore and Nemhauser [27] as well as Lawler, Lenstra, Rinnoy Kan, and Shmoys [79] and Burkard, Deineko, Dal, van der Veen, and Woeginger [39] contain material regarding the use of dynamic programming for the traveling salesman problem. Gusfield [60] is good general resource on dynamic programming for computational biology. Examples with regards to scheduling can be found in Bellman, Esogbue, and Nabeshima [26] as well as in Brucker [33] and in Brucker and Knust [35]. David Eppstein, Zvi Galil, Raffaele Giancarlo, and Giuseppe F. Italiano [52, 53] give a two paper sequence in the Journal of the ACM for sparse dynamic programming.

Regarding online algorithms (discussed in Sect. 3) the book by Borodin and El-Yanif [30] is the standard text. Of note is also the monograph by Karlin [66]. The article Larmore and Chrobak [42], which introduced work functions, extends the material in Sect. 3. For further immersion into the realm of work functions the article by Bein, Chrobak, and Larmore [16] is recommended. Recent work by Bein, Larmore, Noga, and Reischuk [23] on knowledge states has extended the use of work functions to randomized online algorithms: such algorithms are “guided” in their operation by work functions.

Some of the material presented in Sect. 4 can be found in greater detail in Bein, Larmore, Morales, and Sudborough [22]. Further reading regarding this section: Sorting problems under various operations have been studied extensively, including work on sorting with prefix reversals Gates and Papadimitriou [58] (Gates of Microsoft fame) as well as Heydari and Sudborough [62], transpositions [11] and block moves ([17, 80] as well as Mahajan, Rama, and Vijayakumar [81]).

Some of the material presented in 5 can found in greater detail in Bein, Noga, and Wiegley [19]. The paper by Brucker and Albers [6] contains an alternate linear time algorithm for list batching. There is a large body of work on dynamic programming and batching; see the work of Baptiste [12], Baptiste and Jouglet [13], Brucker, Gladky, Hoogeveen, Kovalyov, Potts Tautenhahn, and van de Velde [36], Brucker, Kovalyov, Shafransky, and Werner [37], Hoogeveen, and Vestjens [64], as well as the scheduling text book by Peter Brucker [33]. Batching has wide application in manufacturing (see, e.g., [34, 85, 98]), decision management (see, e.g., [74]), and scheduling in information technology (see, e.g., [46]). More recent work on online batching is related to the TCP (Transmission Control Protocol) acknowledgment problem (see [20, 49, 67]).

Regarding Monge properties and total monotonicity the best resource is Park’s thesis, [84]. Another excellent survey on Monge properties is Burkard, Klinz and Rudolf [38]. A generalization of the Monge property to an algebraic property is in Bein, Brucker, Larmore, and Park [18]. Interesting applications are in Woeginger [95]. Burkard, Deineko, and Woeginger [40] survey the traveling salesman problem on Monge matrices. Agarwal and Sen [1] give results for selection in monotone matrices for computing kth nearest neighbors. Schieber [86] gives results on k-link paths in graphs with Monge properties.

Section 6 is based on Agarwal, Klawe, Moran, Shor, and Wilber [3] and Larmore and Schieber [78]. Generalizations of the matrix searching techniques are in Klawe [69] and Klawe and Kleitman [70], as well as in Kravets and Park [73], Aggarwal and Park [5] (Parallel Searching). and Wilber [94]. A good survey paper is Galil and Park [57].

An extended version of the material of Sect. 7 can be found in Bein, Larmore, Golin, and Zhang [21]. For Sect. 7 the papers of Knuth [71] followed by Yao [97] are key. Aggarwal and Park used an algorithm of Wilber [94] to find various tube maxima of their matrices, leading to another O(n 2) algorithm for solving the KY-problem. There are two extensions of the Knuth-Yao quadrangle inequality: the first is due to Wachs [91] and the second to Borchers and Gupta (BG) [29]. This is discussed in detail in Bein, Larmore, Golin, and Zhang [21]. Belatedly, Aggarwal, Bar-Noy, Khuller, Kravets, and Schieber [4] describe solutions for matching using the quadrangle inequality.

The main gist of this chapter is dynamic programming speedup: Applications abound. Classic examples include the work by Hirschberg and Larmore [63] on the weight subsequence problem, Larmore and Przytycka [76] on parallel construction of trees with optimal path length, and Larmore and Hirschberg [75] on length limited coding. David Eppstein [51] considers sequence comparisons. Apostolico, Atallah, Larmore, and McFaddin [7] give algorithms for string editing. Highly recommended is the paper by Galil and Giancarlo [56] on speeding up dynamic programming in Molecular Biology. Work by Arslan and Egecioglu [8] is on sequence alignment Bradford, Golin, Larmore, and Rytter [31] give dynamic programs for optimal prefix-free codes. Recent speedup work is for the online maintenance of k-medians by Fleischer, Golin, and Zhang [54] (see also [61, 89]).

10 Cross-References

Advances in Scheduling Problems

Computing Distances Between Evolutionary Trees

Efficient Algorithms for Geometric Shortest Path Query Problems

Geometric Optimization in Wireless Networks

Online and Semi-online Scheduling

Resource Allocation Problems