1 Introduction

A novel L 1-based analysis of the Iterative Proportional Fitting (IPF) procedure is presented. The IPF procedure is an algorithm for scaling rows and columns of an input k× weight matrix A=((a ij )) so that the output matrix B=((b ij )) achieves row sums equal to a prespecified vector of row marginals, r=(r 1,…,r k ), and column sums equal to a prespecified vector of column marginals, s=(s 1,…,s ). We assume that all weights are nonnegative, a ij ≥0, and that at least one entry in each row and column of A is positive. All marginals are taken to be positive, r i >0 and s j >0.

The problem family has a continuous member, the biproportional fitting problem, and a discrete member, the biproportional apportionment problem. The continuous case allows the entries of B to be nonnegative, b ij ∈[0,∞). The output B is called the biproportional fit of the weight matrix A to the target marginals r and s. The IPF procedure produces scaled matrices A(t)=((a ij (t))) with matching row sums for odd steps, a i+(t+1)=r i for all ik, and matching column sums for even steps, a +j (t+2)=s j for all j. Throughout this paper a subscript plus-sign indicates summation over the index concerned, such as a i+=∑ j a ij etc. If the biproportional fit B exists, the sequence of scaled matrices A(t), t≥0, converges to B.

In the discrete case the entries of B are restricted to be natural numbers, b ij ∈{0,1,2,…}, as are the marginal components r i and s j . Now the output B is called a biproportional apportionment, for the weight matrix A and the target marginals r and s. The discrete problem may be solved using the Alternating Scaling (AS) algorithm. It produces scaled matrices A(t), but rounds its entries a ij (t) to integers. There are rare instances when the biproportional apportionment B exists while the AS algorithm fails to converge to it. An example is given by Gaffke and Pukelsheim (2008, p. 157).

The present paper focuses on the continuous fitting problem where the relevant terms r i ,s j ,a ij (t),b ij are not restricted to be integers. Yet our major tool, the L 1-error function

$$ f \bigl( A(t) \bigr) = \sum_{i \le k} \bigl|a_{i+}(t) - r_i \bigr| + \sum_{j \le \ell} \bigl| a_{+j}(t) - s_j \bigr|, $$

is borrowed from Balinski and Demange’s (1989a, 1989b) inquiry into the discrete apportionment problem. In the discrete case the error function is quite suggestive, simply counting how many units are wrongly allocated in step t. For the continuous problem the L 1-error is, at first glance, just one out of many ways to assess a lack of fit. At second glance it is a most appropriate way, as this paper endeavors to show.

1.1 The literature on biproportional fitting

The continuous biproportional fitting problem is the senior member of the problem family. It has created an enormous body of literature; we review only the papers that influenced the present research. The term IPF procedure prevails in Statistics, see Fienberg and Meyer (2006), or Speed (2005). Some Statisticians speak of matrix raking, such as Fagan and Greenberg (1987). In Operations Research and Econometrics the label RAS method is popular, pointing to a (diagonal) matrix R of row multipliers, the weight matrix A, and a (diagonal) matrix S of column multipliers, as mentioned already by Bacharach (1965, 1970). Computer scientists prefer the term matrix scaling, as in Rote and Zachariasen (2007).

The IPF procedure became popular through Deming and Stephan (1940), see Fienberg and Meyer (2006). Deming and Stephan (1940, p. 440) recommend terminating iterations when the table reproduces itself. This closeness is what is measured by the L 1-error function f(A(t)), see the display preceding Lemma 2 in Sect. 3. While successfully advocating the merits of the algorithm, Deming and Stephan were somewhat led astray in its analysis, as communicated by Stephan (1942).

Brown (1959) proposes a convergence proof that Ireland and Kullback (1968) criticize to lack rigor. The latter authors establish convergence by relating the IPF procedure to the minimum entropy solution. Csiszár (1975, p. 155) notes that their argument is flawed, and that the generalization to measure-spaces by Kullback (1968) suffers from a similar deficiency. Csiszár (1975) salvages the entropy approach, and Rüschendorf (1995) extends it to general measure-spaces. Rüschendorf and Thomsen (1993, 1997) rectify a technical detail that escaped Csiszár’s (1975) attention.

Ireland and Kullback (1968, Eqs. (4⋅32) and (4⋅33)) ultimately substitute convergence of entropy by convergence in L 1, referring to a result of Kullback (1966). Also Bregman (1967) starts out with entropy, and then uses the L 1-error function. Here we focus on the L 1-approach from start to finish. Ireland and Kullback (1968, p. 184) prove that the entropy criterion decreases monotonically, as does the likelihood function of Bishop et al. (1975, p. 86), and the L 1-error function, see Bregman (1967, p. 197). Marshall and Olkin (1968) and Macgill (1977) minimize a quadratic objective function. Pretzel (1980) uses a geometric matrix-mean and makes do with the arithmetic-geometric-mean inequality. The computational complexity of the IPF procedure is investigated by Kalantari et al. (2008).

The existence question is studied by many authors, such as Brualdi et al. (1966), Rothblum and Schneider (1989), Schneider (1990), Brown et al. (1993), Hershkowitz et al. (1997). Some papers employ network and graph theory by interpreting the problem as a transportation problem, as reviewed by Pukelsheim et al. (2012). Since this author was unable to smooth the arguments used here by employing graph theoretic tools the present paper’s language is that of linear algebra.

Fienberg (1970) opens up a different route by embedding the IPF procedure into the geometry of the manifold of constant interaction in a (kℓ−1)-dimensional simplex of reference. The author works with the assumption that all input weights are positive, a ij >0. He points out (p. 915) that the extension to problems involving zero weights is quite complex, which is attested to by much of the literature. Ireland and Kullback’s (1968, p. 182) plea of assuming positive weights in order to simplify the argument is a friendly understatement, unless it is meant to be the utter truth.

Yet another approach, staying close to calculus and linear algebra, is due to Bacharach (1965, 1970), and Sinkhorn (1964, 1966, 1967, 1972, 1974) and Sinkhorn and Knopp (1967). Our viewpoint follows their lead. Michael Owen Leslie Bacharach (b. 1936, d. 2002) was an Oxford econometrician. In 1965 he earned a PhD degree in Mathematics from Cambridge. His thesis was published as Bacharach (1965), and became Sect. 4 of Bacharach (1970). Richard Dennis Sinkhorn (b. 1934, d. 1995) received his Mathematics PhD in 1962 from the University of Wisconsin–Madison, with a thesis entitled On Two Problems Concerning Doubly Stochastic Matrices. Throughout his career he served as a Mathematics professor with the University of Houston. Though contemporaries, neither of the two ever quoted the other.

1.2 The literature on biproportional apportionment

The discrete biproportional apportionment problem is the junior member of the problem family. It was put forward first by Balinski and Demange (1989a, 1989b), see also Simeone and Pukelsheim (2006). The rounding of scaled quantities to integers is appropriate for the statistical analysis of frequency tables, as noted by Wainer (1998) and Pukelsheim (1998). It disposes of any disclaimer that the adjusted figures are rounded off, hence when summed may occasionally disagree a unit or so, as warned in Table I of Deming and Stephan (1940, p. 433). When calculating percentages, as in Table 3.6-4 of Bishop et al. (1975, p. 99), the method finishes off with 100 percent and does not stop short with 99 percent. Yet Balinski’s motivation was not contingency table analysis in statistics, but proportional representation systems for parliamentary elections.

The task of allocating seats of a parliamentary body to political parties does not tolerate any disclaimer excusing residual rounding errors. Methods must account for every seat. This is achieved by biproportional methods. In 2003, the Swiss Canton of Zurich adopted a doubly proportional system, the biproportional divisor method with standard rounding, see Pukelsheim and Schuhmacher (2004, 2011), and Balinski and Pukelsheim (2006). The method may be attractive also for other countries as studied by Pennisi (2006) for Italy, Zachariassen and Zachariasen (2006) for the Farœ Islands, Ramírez et al. (2008) for Spain, and Oelbermann and Pukelsheim (2011) for the European Union. A computer program for the calculations is provided at www.uni-augsburg.de/bazi, see Pukelsheim (2004), Joas (2005), Maier (2009). The user may choose to run the IPF procedure, the AS algorithm, the Tie-and-Transfer (TT) algorithm of Balinski and Demange (1989b), or various hybrid combinations. The performance of these algorithms is studied by Maier et al. (2010).

In the electoral application the entries a ij in the input weight matrix A signify vote counts. When a party j does not campaign in district i it enters into the final evaluations with vote count a ij =0. Therefore weights that are zero must be properly dealt with, even if the labor entailed becomes quite complex. It is no longer appropriate to assume all weights to be positive in order to simplify the argument.

1.3 Section overview

A brief overview of the paper is as follows. Section 2 investigates biproportional scalings of a given weight matrix A. A biproportional scaling reweighs rows and columns of A moderately enough so that none of the rows nor columns is annihilated. If two scalings share the same row and column sums, then they coincide (Theorem 1). A scaling is called direct when a passage to the limit, though allowed by definition, becomes superfluous. In such a case the input matrix A and the output matrix B decompose accordingly (Lemma 1), see also Balinski and Demange (1989a), Gietl (2009). Theorem 2 establishes necessary and sufficient conditions to check for directness.

Section 3 turns to the IPF procedure by adjoining the goal to match prespecified row marginals r and prespecified column marginals s. In each step t a scaled weight matrix A(t) is produced that has matching rows for odd t, and matching columns for even t. The fit of A(t) is measured by the L 1-error function mentioned above. Lemma 2 ascertains that the L 1-error is non-increasing, and admits a finite set of lower bounds. The largest lower bound turns out to be equal to the L 1-error limit. The L 1-error limit formula is new. Its proof makes use of the result that the IPF even-step subsequence is always convergent, as shown by Gietl and Reffel (2013) on the basis of the information divergence approach of Csiszár and Tusnády (1984). The result replaces the IPF conjecture on which a previous version of the present paper relied, see Pukelsheim (2013). Based on the limiting L 1-errror formula Theorem 3 offers a brief proof of the necessary and sufficient conditions for the IPF sequence to converge to the biproportional fit.

2 Biproportional scalings

Let A=((a ij )) be a given k× weight matrix, that is, A is assumed to have nonnegative entries and no zero row nor zero column. Whether rows ik or columns j do not vanish is conveniently read off from their component sums, a i+>0 and a +j >0. Another weight matrix B is said to preserve the zeros of A when all zeros of A are zeros also of B, a ij =0⇒b ij =0. Two matrices A and B have the same zeros when a ij =0⇔b ij =0. We concentrate on true matrix problems, k≥2 and ≥2.

Definition 1

A k× weight matrix B=((b ij )) is called a biproportional scaling of A when for all rows ik and for all columns j there are sequences of positive row divisors ρ i (1),ρ i (2),… and of positive column divisors σ j (1),σ j (2),… such that

$$ b_{ij} = \lim_{n \to \infty} { a_{ij} \over \rho_i(n)\sigma_j(n) }. $$

A biproportional scaling B is termed direct when its associated divisor sequences can be chosen to be constant, that is, for all rows ik and for all columns j there are positive divisors μ i and ν j such that b ij =a ij /(μ i ν j ).

The requirement that B is a weight matrix necessitates positive row and column sums and imposes a certain balance on row and column divisors. Neither can grow unboundedly of an order dwarfing the other lest a vanishing row or column invalidates the definition. Furthermore the requirement implies uniqueness of biproportional scalings that share the same marginals.

Theorem 1

(Uniqueness)

If two biproportional scalings B and C of a weight matrix A share the same row and column sums, b i+=c i+ for all rows ik and b +j =c +j for all columns j, then they coincide, B=C.

Proof

The proof is by contraposition. Assuming the two scalings to be distinct, BC, the difference BC is nonzero, but has row and column sums vanishing. We construct a cycle of cells

$$ (i_1,j_1), (i_1,j_2), (i_2,j_2), (i_2,j_3), \ldots, (i_{q-1},j_{q-1}), (i_{q-1},j_q), (i_q,j_q), (i_q,j_1) $$
(CC)

along which the entries in BC alternate in sign. First we assemble a “long list” of cells (i 1,j 1),(i 1,j 2),(i 2,j 2),…,(i Q ,j Q ),(i Q ,j 1), as follows. We start with a cell (i 1,j 1) where \(b_{i_{1}j_{1}} > c_{i_{1}j_{1}}\). In row i 1 there is a cell (i 1,j 2) with \(b_{i_{1}j_{2}} < c_{i_{1}j_{2}}\). Next we search in column j 2 a row i 2 where \(b_{i_{2}j_{2}} > c_{i_{2}j_{2}}\). Then we look for a column j 3 such that \(b_{i_{2}j_{3}} < c_{i_{2}j_{3}}\). The long list terminates when encountering a column j Q already listed, that is, when for some P>Q we find j P =j Q . The initial P−1 cells are discarded, and the remaining “short list” is relabeled as in (CC).

A cyclic ratio is a ratio having the entries along a given cell cycle alternately appear in the numerator and in the denominator. Since a ij =0 implies b ij =c ij =0, the cell cycle [CC] involves only positive entries of the weight matrix A. Let ρ i (n) and σ j (n) be the divisor sequences for B, and μ i (n) and ν j (n) for C. Since biproportionality preserves cyclic ratios, the cyclic ratios in A, B, and C are seen to be equal,

$$ \prod_{p\le q} { a_{i_pj_p} \over a_{i_pj_{p+1}} } =\prod _{p\le q} {\ { a_{i_pj_p}\over \rho_{i_p}(n)\sigma_{j_p}(n)} \ \over {a_{i_pj_{p+1}}\over \rho_{i_p}(n)\sigma_{j_{p+1}}(n)} } =\prod _{p\le q} { b_{i_pj_p} \over b_{i_pj_{p+1}}} =\prod _{p\le q} {\ {a_{i_pj_p}\over \mu_{i_p}(n)\nu_{j_p}(n)} \ \over {a_{i_pj_{p+1}}\over \mu_{i_p}(n)\nu_{j_{p+1}}(n)}} =\prod _{p\le q} { c_{i_pj_p} \over c_{i_pj_{p+1}} }, $$

where j q+1=j 1. The first and third equation signs are obvious. The fourth equality involves a passage to the limit as n tends to ∞ and holds since the limiting denominator is positive, \(c_{i_{p}j_{p+1}} > b_{i_{p}j_{p+1}} \ge 0\). As the quotient is positive, the numerator must be positive, too, \(c_{i_{p}j_{p}} > 0\). A similar argument establishes the second equality.

But the cycle (CC) precludes equality, \(\prod_{p\le q} b_{i_{p}j_{p}}/b_{i_{p}j_{p+1}} > \prod_{p\le q} c_{i_{p}j_{p}}/c_{i_{p}j_{p+1}}\). Hence the assumption BC is untenable and uniqueness obtains, B=C. □

Directness of a biproportional scaling is closely related to the notion of connectedness. However, disconnectedness is defined first. A nonzero matrix D is called disconnected when a suitable permutation of rows and a suitable permutation of columns give rise to a row subset I and a column subset J such that D acquires block structure,

$$ D = \bordermatrix{ &J&J'\cr I&D^{(1)}&0\cr I'&0&D^{(2)}\cr}, $$

where at least one of the subsets I or J is non-empty and proper, ∅⫋I⫋{1,…,k} or ∅⫋J⫋{1,…,}. Throughout this paper a prime denotes a set complement, such as I′={1,…,k}∖I etc. In most applications both subsets I and J are non-empty and proper. A nonzero matrix C is said to be connected when it is not disconnected.

For keeping track of the nonzero entries in a weight matrix A we associate with every row subset I⊆{1,…,k} the set of columns connected in A with I,

$$ J_A(I) = \{ j \le \ell \mid a_{ij} > 0\ \hbox{for some}\ i \in I \}. $$

The complement J A (I)′ embraces the columns j with entries a ij =0 for all iI. Hence the I×J A (I)′ submatrix of A vanishes and the sum of its entries is zero, \(a_{I \times J_{A}(I)'} = 0\). The extreme settings provide simple examples. If we choose I={1,…,k} then we get J A (I)={1,…,}, since no row nor column of A vanishes. If I=∅ then J A (I)=∅.

The term “connectness” is used in the same way that is familiar from graph theory. The matrix A is connected, in the sense of linear algebra as defined above, if and only if the bipartite graph with rows and columns as vertices and with non-zero entries in A as edges is connected, in the sense of graph theory.

When the input weight matrix A is disconnected, the calculation of a biproportional scaling B decomposes into several separate instances. Hence there is no loss of generality of assuming A to be connected. An ensuing biproportional scaling B may still turn out to be disconnected. If so, the pattern of zeros in the output matrix B has repercussions on the pattern of zeros in the input matrix A, as follows.

Lemma 1

(Joint decomposition)

For every connected weight matrix A and for every disconnected biproportional scaling B of A there exists a non-empty and proper subset I of rows, ∅⫋I⫋{1,…,k}, such that A acquires block-triangular structure and B acquires block-diagonal structure,

$$ A = \bordermatrix{ &J_A(I)&J_A(I)'\cr I&A^{(1)}&0\cr I'&A^{(2,1)}&A^{(2)}\cr}, \qquad B = \bordermatrix{ &J_B(I)&J_B(I)'\cr I&B^{(1)}&0\cr I'&0&B^{(2)}\cr}, $$

where the sets of columns connected in A or B with I are equal, J A (I)=J B (I).

Proof

Without loss of generality we assume the largest accumulation point of the column divisors to be positive, lim sup n→∞ σ max(n)=M∈(0,∞), where σ max(n)=max{σ 1(n),…,σ (n)}. If need be, we would adjust the divisors according to \(\widetilde{\rho}_{i}(n) = \rho_{i}(n) \sigma_{\max}(n)\) and \(\widetilde{\sigma}_{j}(n) = \sigma_{j}(n)/\sigma_{\max}(n) \le 1\), and use M=1. Let an arrow → indicate a passage to the limit as n tends to infinity. The set I is defined to contain the rows with divisors not degenerating, in the sense of not diverging to infinity,

$$ I = \bigl\{ i \le k \mid \rho_i(n) \not \to \infty \bigr\}, \qquad I' = \bigl\{ i \le k \mid \rho_i(n) \to \infty \bigr\}. $$

Likewise the columns connected in B with I will turn out to have their divisors not degenerating, but now in the sense of not converging to zero,

$$ J_B(I) = \bigl\{ j \le \ell \mid \sigma_j(n) \not \to 0 \bigr\}, \qquad J_B(I)' = \bigl\{ j \le \ell \mid \sigma_j(n) \to 0 \bigr\}. $$

With A connected and B disconnected there exists a cell (i,j) that is fading, a ij >0=b ij . This implies lim n→∞ ρ i (n)σ j (n)=∞ and, since column divisors stay bounded by assumption, lim n→∞ ρ i (n)=∞. Hence I′ is not empty, nor is J B (I′)—which in the end will turn out to coincide with J B (I)′. By definition of J B (I′), the I′×J B (I′)′ block of B is zero.

The columns jJ B (I′) have their divisors converge to zero, lim t→∞ σ j (n)=0. Indeed, there exists a row iI′ with b ij >0. In this cell we have lim n→∞ ρ i (n)σ j (n)=a ij /b ij <∞. Since the divisors of row iI′ diverge to infinity, column j has its divisors converge to zero.

By assumption the column divisors satisfy σ max(n)>M/2 infinitely often. Thus there exists a column j with divisors fulfilling σ j (n)>M/2 again and again and not converging to zero. Hence J B (I′)′ is not empty, nor is I.

Now every column jJ B (I′)′ has its divisors not converging to zero, lim sup n→∞ σ j (n)>0. Indeed, there is a row iI with b ij >0. In this cell we have lim n→∞ ρ i (n)σ j (n)=a ij /b ij >0. Since column j admits a divisor subsequence bounded away from zero, the row divisors that go along cannot diverge to infinity.

The I×J B (I′) top right block of A has a ij =0. Indeed, the case a ij >0 would admit a row divisor subsequence bounded from above, in the presence of column divisors converging to zero. But a ij >0 and lim n→∞ ρ i (n)σ j (n)=0 lead to the contradiction b ij =∞. The top right block of B inherits the zeros of A. The structures of B and A entail J B (I′)′=J B (I)=J A (I). □

Given a biproportional scaling B, there are various ways to check for directness.

Theorem 2

(Directness)

For every connected weight matrix A and for every biproportional scaling B of A the following five statements are equivalent:

  1. (a)

    The biproportional scaling B is direct.

  2. (b)

    The matrices A and B have the same zeros.

  3. (c)

    There exists a weight matrix C sharing the same zeros with A and the same row and column sums with B.

  4. (d)

    For every non-empty and proper subset I of rows, ∅⫋I⫋{1,…,k}, partial row and column sums of B fulfill \(\sum_{i \in I} b_{i+} < \sum_{j \in J_{A}(I)} b_{+j}\).

  5. (e)

    The matrix B is connected.

Proof

(a) ⇒ (b). A direct scaling, b ij =a ij /(μ i ν j ), has the same zeros as has A.

(b) ⇒ (c). The scaling B, sharing all zeros with A, is of the type asked for in (c).

(c) ⇒ (d). For every row subset I we have \(a_{I \times J_{A}(I)'} = 0\), and hence \(c_{I \times J_{A}(I)'} = 0\). If I is non-empty and proper then \(c_{I' \times J_{A}(I)} > 0\), as otherwise C were disconnected and so would be A. We get \(\sum_{i \in I} b_{i+} = c_{I \times J_{A}(I)} < c_{I \times J_{A}(I)} + c_{I' \times J_{A}(I)} = \sum_{j \in J_{A}(I)} b_{+j}\).

(d) ⇒ (e). The proof is by contraposition. If B is disconnected, then Lemma 1 provides a non-empty and proper row set I fulfilling \(\sum_{i \in I} b_{i+} = b_{I \times J_{A}(I)} = \sum_{j \in J_{A}(I)} b_{+j}\).

(e) ⇒ (a). Row divisors μ i and column divisors ν j for B are constructed in the course of a scanning process. The process is initialized by standardizing the given divisor sequences according to \(\widetilde{\rho}_{i}(n) = \rho_{i}(n) / \rho_{1}(n)\) and \(\widetilde{\sigma}_{j}(n) = \rho_{1}(n) \sigma_{j}(n)\), thus equipping the first row with constant divisor unity, \(\widetilde{\rho}_{1}(n) = 1 = \mu_{1}\), n≥1. Then the process scans all columns j with b 1j >0, and sets

$$ 0 < \nu_j = {a_{1j} \over \mu_1 b_{1j}} = {\lim_{n \to \infty} \widetilde{\rho}_1(n) \widetilde{\sigma}_j(n) \over \lim_{n \to \infty} \widetilde{\rho}_1(n)} = \lim_{n \to \infty} \widetilde{\sigma}_j(n), \quad \mbox{whence}\ b_{1j} = {a_{1j} \over \mu_1 \nu_j}. $$

Next all unscanned rows i with b ij >0 for some scanned column j are scanned, setting

$$ 0 < \mu_i = {a_{ij} \over b_{ij} \nu_j} = {\lim_{n \to \infty} \widetilde{\rho}_i(n) \widetilde{\sigma}_j(n) \over \lim_{n \to \infty} \widetilde{\sigma}_j(n)} = \lim_{n \to \infty} \widetilde{\rho}_i(n), \quad \mbox{whence}\ b_{ij} = {a_{ij} \over \mu_i \nu_j}. $$

Thereafter the process turns to columns again, then rows. In this fashion it keeps enlarging the scanned sets of rows and columns, terminating after at most k+ steps. The terminal scanned row set I and column set J enforce a block structure upon B,

$$ B = \bordermatrix{ &J&J'\cr I&B^{(1)}&0\cr I'&0&B^{(2)}\cr}. $$

Connectedness of B lets the scanned sets be exhaustive, I={1,…,k} and J={1,…,}. All rows and all columns having constant divisors, the scaling is direct. □

We now turn to the Iterative Proportional Fitting (IPF) procedure proper.

3 The IPF procedure

The IPF procedure seeks to calculate a biproportional scaling B that achieves pre-specified row marginals r and pre-specified column marginals s. Let A be a given k× weight matrix. Furthermore let r=(r 1,…,r k ) and s=(s 1,…,s ) be given vectors with positive entries, called target row marginals and target column marginals.

Definition

A k× nonnegative matrix B=((b ij )) is said to match the target marginals r and s when its row sums are equal to r and its column sums are equal to s, that is, b i+=r i for all ik and b +j =s j for all j. A weight matrix B is called a biproportional fit of the weight matrix A to the target marginals r and s when B is a biproportional scaling of A and B matches the target marginals r and s.

With this terminology the IPF procedure aims to determine a biproportional fit of the weight matrix A to the target marginals r and s. If a fit exists then it is unique, by Theorem 1. An existing fit B necessitates equal totals of the marginals, r +=b ++=s +. We do not assume equality of the totals, though, since the IPF procedure is well-defined without this assumption and since distinct marginal subtotals may evolve when the procedure leads to accumulation points that are disconnected.

The IPF procedure starts by scaling A into a matrix A(0) with column sums equal to target column marginals. That is, with column divisors β j (0)=a +j /s j the entries of A(0) are defined to be a ij (0)=a ij /β j (0), for all ik and j. Thereafter the procedure advances in pairs of an odd step t+1 and an even step t+2, for t=0,2,… :

  • Odd steps t+1 fit row sums to target row marginals by calculating row divisors α i (t+1) from the preceding even step t, and scaled weights a ij (t+1):

    $$\begin{aligned} \alpha_i(t+1) &= {a_{i+}(t) \over r_i}, \end{aligned}$$
    (IPF1)
    $$\begin{aligned} a_{ij}(t+1) &= {a_{ij}(t) \over \alpha_i(t+1)}, \end{aligned}$$
    (IPF2)

    for all rows ik and for all columns j.

  • Even steps t+2 fit column sums to target column marginals by calculating column divisors β j (t+2) from the preceding odd step t+1, and scaled weights a ij (t+2):

    $$\begin{aligned} \beta_j(t+2) &= {a_{+j}(t+1) \over s_j}, \end{aligned}$$
    (IPF3)
    $$\begin{aligned} a_{ij}(t+2) &= {a_{ij}(t+1) \over \beta_j(t+2)}, \end{aligned}$$
    (IPF4)

    for all columns j and for all rows ik.

The incremental row divisors α i (1),α i (3),… from [IPF1] give rise to cumulative row divisors ρ i , and the incremental column divisors β j (2),β j (4),… from [IPF3] generate cumulative column divisors σ j . They are defined, for steps t=0,2,… , through

$$ \begin{aligned} &\alpha_i(1) \, \alpha_i(3) \, \cdots \, \alpha_i(t+1) = \rho_i(t+1) = \rho_i(t+2), \\ &\beta_j(0) \, \beta_j(2) \, \beta_j(4) \, \cdots \, \beta_j(t+2) = \sigma_j(t+2) = \sigma_j(t+3). \cr \end{aligned} $$

Adjoining ρ i (0)=1 and σ j (0)=σ j (1)=β j (0), cumulative divisors are defined for all steps t≥0. The scaled weights take the form a ij (t)=a ij /(ρ i (t)σ j (t)).

The scaled weight matrices A(t)=((a ij (t))), t≥0, constitute the IPF sequence, for the fitting of the weight matrix A to the target marginals r and s. An L 1-error function f is employed to assess the goodness-of-fit of a scaled weight matrix A(t). It checks whether any row is underfitted, a i+(t)<r i , or overfitted, a i+(t)>r i , as well as whether any column is under- or overfitted, and then totals the absolute deviations between current sums and target marginals,

$$ f \bigl( A(t) \bigr) = \sum_{i \le k} \bigl|a_{i+}(t) - r_i \bigr| + \sum_{j \le \ell} \bigl| a_{+j}(t) - s_j \bigr|. $$

Odd steps t have row sums matching their target marginals, whence the L 1-error f(A(t)) is equal to the (second) column-error sum. For even steps t, the (first) row-error sum is decisive.

The L 1-error coincides with the L 1-distance between a scaled weight matrix and its successor. To see this for an even step t, we substitute r i =a i+(t)/α i (t+1) from (IPF1) and a ij (t)/α i (t+1)=a ij (t+1) from (IPF2) to obtain

$$ f \bigl(A(t) \bigr) = \sum_{i \le k} \biggl| 1 -{1 \over \alpha_i(t+1) } \biggr| a_{i+}(t) = \sum _{i \le k} \sum_{j \le \ell} \bigl|a_{ij}(t) - a_{ij}(t+1) \bigr|. $$

Definitions (IPF3) and (IPF4) confirm the result for odd steps t. Deming and Stephan (1940, p. 440) recommend that the IPF procedure is continued until the table reproduces itself. This is exactly what is captured by the error function f: The table reproduces itself, A(t)=A(t+1), if and only if the L 1-error is zero, f(A(t))=0.

Lemma 2 shows that the L 1-error is non-increasing in t, admits a finite set of lower bounds, and converges to the largest of these bound as t→∞. The new limit formula in part (c) greatly simplifies the proof of the Convergence Theorem 3 below.

Lemma 2

(L 1-error)

Let A(t), t≥0, be the IPF sequence for the fitting of the weight matrix A to the target marginals r and s. Then we have for all steps t≥0:

  1. (a)

    The L 1-error is non-increasing, f(A(t))≥f(A(t+1)).

  2. (b)

    The L 1-error is bounded according to \(f (A(t) ) \ge r_{I} - s_{J_{A}(I)} + s_{J_{A}(I)'} - r_{I'}\), for all row subsets I⊆{1,…,k}.

  3. (c)

    The limiting L 1-error is given by

    $$ \lim_{t \to \infty} f \bigl(A(t) \bigr) = \max_{I \subseteq \{ 1,\ldots,k \} } ( r_I - s_{J_A(I)} + s_{J_A(I)'} - r_{I'} ). $$

Proof

(a) Let step t≥0 be even, whence A(t) has fitted columns and its L 1-error originates from rows. From (IPF1) and a i+(t)=α i (t+1)r i we get f(A(t))=∑ ik |a i+(t)−r i |=∑ ik |α i (t+1)−1|r i . Insertion of r i =∑ j a ij (t+1) allows us to apply the triangle inequality within each column j,

$$ \sum_{j \le \ell} \sum_{i \le k} \bigl|1- \alpha_i(t+1) \bigr| a_{ij}(t+1) \ge \sum _{j \le \ell} \biggl| \sum_{i \le k} \bigl( 1 - \alpha_i(t+1) \bigr) a_{ij}(t+1) \biggr|. $$

From (IPF2) and ∑ ik α i (t+1)a ij (t+1)=a +j (t)=s j we obtain ∑ j |a +j (t+1)−s j |=f(A(t+1)). With Definitions (IPF3) and (IPF4) the argument carries over to odd steps t≥1. Thus monotonicity is established.

(b) Because of monotonicity we may assume step t to be even. Let I be a subset of rows. Since ∑ iI (a i+(t)−r i )+∑ iI(a i+(t)−r i )=s +r +, the complement I′ satisfies ∑ iI(a i+(t)−r i )=s +r ++r I −∑ iI a i+(t). Hence we obtain

$$ f \bigl(A(t) \bigr) \ge \sum_{i \in I} \bigl( r_i - a_{i+}(t) \bigr) + \sum_{i \in I'} \bigl( a_{i+}(t) - r_i \bigr) = s_+ - r_+ + 2r_I - 2 \sum_{i \in I} a_{i+}(t). $$

From \(\sum_{i \in I} a_{i+}(t) = \sum_{i \in I} \sum_{j \in J_{A}(I)} a_{ij}(t) \le \sum_{j \in J_{A}(I)} \sum_{i \le k} a_{ij}(t) = s_{J_{A}(I)}\) we get

$$ s_+ - r_+ + 2r_I - 2 \sum_{i \in I} a_{i+}(t) \ge s_+ - r_+ + 2 r_I - 2 s_{J_A(I)} = r_I - s_{J_A(I)} + s_{J_A(I)'} - r_{I'}. $$

This establishes the lower bounds in (b).

(c) Theorem 5.3 of Gietl and Reffel (2013) states that the IPF even-step subsequence is convergent, lim t=0,2,4,… A(t)=B say. In view of part (i) every accumulation point minimizes the L 1-error, whence lim t→∞ f(A(t))=lim t=0,2,4,… f(A(t))=f(B). We claim that U={ikb i+<r i }, the set of underfitted rows in B, fulfills

$$ f(B) = r_U - s_{J_A(U)} + s_ {J_A(U)'} - r_{U'}. $$
(1)

This and the boundedness part (b) establish the limit formula in part (c). If the set U is empty, U=∅, then f(B)=s +r + verifies (1). If the set U is exhaustive, U={1,…,k}, then f(B)=r +s + verifies (1). Generally (1) is verified in four steps.

The first step claims that all row sums of B are positive, b i+>0 for all ik. Indeed, Definition (IPF3) implies a uniform upper bound for the column divisors, β j (t+2)=a +j (t+1)/s j a ++(t+1)/s min=r +/s min, where s min=min{s 1,…,s k }. Now definition (IPF4) yields a lower bound,

$$ a_{i+}(t+2) = \sum_{j \le \ell} {a_{ij}(t+1) \over \beta_j(t+2)} \ge {s_{\min} \over r_+} a_{i+}(t+1) = {s_{\min} \over r_+} r_i. $$

Thus the limiting row sums remain positive, b i+s min r i /r +>0, as claimed.

Hence the accumulation point B is a weight matrix itself. Let γ i =b i+/r i >0 denote the row divisor to scale row i of B so as to meet the target marginal r i . Upon defining the matrix B(1) with entries b ij (1)=b ij /γ i , let δ j =b +j (1)/s j denote the column divisor to fit column j to the target marginal s j .

The second step establishes a relation between the divisors γ i and δ j ,

$$ b_{ij} > 0 \quad \Rightarrow \quad \gamma_i \delta_j = 1 \qquad \mbox{for all}\ i \le k\ \mathrm{and}\ j \le \ell. $$
(2)

To see this we note that γ i is the limit of the incremental row divisors α i (t+1),

$$ \gamma_i = {b_{i+} \over r_i} = \lim_{t = 0,2,4,\ldots} {a_{i+}(t) \over r_i} = \lim_{t = 0,2,4,\ldots} \alpha_i(t+1), $$
(3)

as follows from (IPF1). Now (3), (IPF2), and (IPF3) yield

$$ \delta_j = {b_{+j}(1) \over s_j} = \sum _{i \le k} {b_{ij} \over \gamma_i s_j} = \lim_{t = 0,2,4,\ldots} \sum_{i \le k} {a_{ij}(t) \over \alpha_i(t+1) s_j} = \lim _{t = 0,2,4,\ldots} \beta_j(t+2). $$
(4)

From (IPF2) and (IPF4) we get a ij (t)=α i (t+1)a ij (t+1)=α i (t+1)β j (t+2)a ij (t+2). Hence the L 1-distance between the scaled weight matrices A(t) and A(t+2) is

$$ \sum_{i \le k} \sum_ {j \le \ell} \bigl| a_{ij}(t) - a_{ij}(t+2) \bigr| = \sum_{i \le k} \sum_ {j \le \ell} \bigl|\alpha_i(t+1) \beta_j(t+2) - 1 \bigr| a_{ij}(t+2). $$

Convergence of the IPF even-step subsequence forces the L 1-distance to vanish in the limit. Together with (3) and (4) we obtain 0=∑ ik j |γ i δ j −1|b ij , that is, (2).

The third step depicts the structure of B as follows:

$$ B\ =\ \begin{matrix} &\begin{matrix}J_B(U)&J_B(U)'\\ \end{matrix}&\\ \begin{matrix}U\\ U'\cr\end{matrix}& \left(\begin{matrix}B^{(1)}&0\\ 0&B^{(2)}\\ \end{matrix}\right) &\begin{matrix}\gamma_i < 1\\ \gamma_i \ge 1\\ \end{matrix}\\ &\begin{matrix}\delta_j > 1&\delta_j \le 1\\ \end{matrix}&\\ \end{matrix} $$

Indeed, the top right block is zero because J B (U) contains all columns connected in B with U. Being underfitted rows iU satisfy γ i <1. Complementary rows iU′ fulfill γ i ≥1. Columns jJ B (U) are connected in B with U and have δ j >1, by (2). Columns jJ B (U)′⊆J B (U′), connected in B with U′, have δ j ≤1, also by (2). Finally the bottom left block is seen to be zero, since iU′ and jJ B (U) have γ i δ j >1 while b ij >0 would imply γ i δ j =1, by (2). Now the structure of B entails \(\sum_{i \in U} b_{i+} = s_{J_{B}(U)}\) and \(\sum_{i \in U'} b_{i+} = s_{J_{B}(U)'}\). The L 1-error of B, depending on rows only, turns into

$$ f(B) = \sum_{i \in U} ( r_i - b_{i+} ) + \sum_{i \in U'} ( b_{i+} - r_i ) = r_U - s_{J_B(U)} + s_{J_B(U)'} - r_{U'}. $$
(5)

The fourth step establishes J B (U)=J A (U). The inclusion J B (U)⊆J A (U) holds true since b ij >0 implies a ij >0. As for the converse inclusion we note that the top right block U×J B (U)′ has γ i δ j <1. Since the products α i (t+1)β j (t+2) converge to γ i δ j <1, by (3) and (4), the cumulative divisors tend to zero, lim t→∞ ρ i (t)σ j (t)=0. Now lim t→∞ a ij /(ρ i (t)σ j (t))=b ij =0 forces a ij =0. But a ij =0 on U×J B (U)′ implies J A (U)⊆J B (U). This proves J A (U)=J B (U), and turns (5) into (1). □

Lemma 2 allows a remarkably brief proof of the well-known necessary and sufficient conditions for the convergence of the IPF procedure. The arrangement of conditions runs parallel to the characterizations of directness in Theorem 2.

Theorem 3

(Convergence)

For the fitting of a weight matrix A to the target marginals r and s the following five statements are equivalent:

  1. (a)

    The IPF sequence A(t), t≥0, is convergent.

  2. (b)

    The biproportional fit of A to r and s exists.

  3. (c)

    There exists a weight matrix D preserving the zeros of A and matching the target marginals r and s.

  4. (d)

    Marginal totals are equal, r +=s + and marginal partial row and column sums fulfill \(r_{I} \le s_{J_{A}(I)}\), for every row subset I⊆{1,…,k}.

  5. (e)

    The L 1-errors of the IPF sequence A(t), t≥0, tend to zero, lim t→∞ f(A(t))=0.

Proof

(a) ⇒ (b). If the IPF sequence converges then its limit B is a biproportional scaling of A. It inherits matching row sums along odd steps and matching column sums along even steps. By Theorem 1, B is the unique biproportional fit.

(b) ⇒ (c). The biproportional fit clearly qualifies for a matrix D asked for in (c).

(c) ⇒ (d). As in the proof of Theorem 2, the definition of J A (I) yields \(d_{I \times J_{A}(I)'} = 0 \le d_{I' \times J_{A}(I)}\), and \(r_{I} = d_{I \times J_{A}(I)} + d_{I \times J_{A}(I)'} \le d_{I \times J_{A}(I)} + d_{I' \times J_{A}(I)} = s_{J_{A}(I)}\).

(d) ⇒ (e). Equal marginal totals entail \(r_{I} - s_{J_{A}(I)} + s_{J_{A}(I)'} - r_{I'} = 2(r_{I} - s_{J_{A}(I)})\). With Lemma 2(c) we get \(0 \le \lim_{t \to \infty} f ( A(t) ) = 2 \max_{I \subseteq \{ 1,\ldots,k \} } (r_{I} - s_{J_{A}(I)}) \le 0\).

(e) ⇒ (a). Since the entries of A(t) are nonnegative and sum to r + in odd steps and to s + in even steps, the IPF sequence A(t) stays in the compact set [0,max{r +,s +}]k×. Let B be an accumulation point along a subsequence A(t n ), n≥1. From (e) we get f(B)=lim n→∞ f(A(t n ))=lim t→∞ f(A(t))=0. With row and column sums fitted, B is a biproportional fit. By Theorem 1 there exists but one. Hence the IPF sequence A(t), t≥0, has B for its unique accumulation point, and converges. □

If all weights are positive, a ij >0, and marginal totals coincide, r +=s +, life simplifies dramatically. Statement (c) is fulfilled by the matrix D with entries d ij =r i s j /r +>0. Statement (d) holds true since non-empty row subsets I satisfy J A (I)={1,…,} and hence \(r_{I} \le s_{J_{A}(I)} = r_{+}\). Either way the IPF sequence is seen to converge (Theorem 3), and the resulting biproportional fit transpires to be direct (Theorem 2).

The results disclose their power if some weight is zero. In the general case it is not easy to decide whether the IPF sequence converges or not. However, the equivalence of statements (c) and (d) is known as the Feasible Distribution Theorem of network theory, see Gale (1957, p. 1075) or Rockafellar (1984, p. 69). It is well-known that statement (d) may be quickly checked by means of the Max-flow Min-cut Theorem.