1 Introduction

Belief propagation (BP) is a message-passing algorithm used to marginalize distributions to variables contained in nodes of a graph. BP operates by sending messages between nodes which are linked in a graph and these messages are updated iteratively. At initialization each node receives a set of random variables and an associated distribution function (often called the potential of the node). At any stage of BP we can instruct nodes to collect all incoming messages, these messages are processed and used by nodes to update their potentials. From this point onwards, we refer to these updated potentials as posterior distributions. A message from a node i to a neighbour j is updated by node i collecting all incoming messages, excluding the message from node j, computing the posterior distribution and then processing this posterior distribution as a message to node j. The way posterior distributions are converted to messages depends on the type of the graph and the goal of the propagation procedure (usually marginalization or determining the mode).

When applied to loopy graphs BP may not converge and if it does, it does not necessarily converge to the correct marginals. Even though BP does not necessarily supply true marginal distributions, it is still useful as an approximate inference tool due to its computational tractability. In this article, we propose a regularized message-passing scheme on general Markov graphs (MGs). The goal here is to improve on the performance of message passing in loopy graphs. The performance of a message-passing scheme in the BP context can be measured by whether or not it converges, the rate at which it converges and the accuracy of the converged posterior distributions as approximations to the true marginals. We illustrate how regularized message passing can be used to address all three of these issues by implementing the scheme on a Gaussian MG under canonical parametrization. This type of belief propagation is often referred to as Gaussian belief propagation (GaBP). For GaBP involving synchronous message passing, each iteration can be performed in at most \({\mathcal {O}}(p^{2})\) computations, assuming p variables distributed among p nodes. The number of computations can be substantially lower than \({\mathcal {O}}(p^{2})\) depending on the degree of sparsity in the precision matrix and further acceleration can be obtained through distributive computing.

A by-product of this inference algorithm is the implicit solving of linear systems, \({\mathbf {S}}{\varvec{\mu }} = {\varvec{b}}\), where \({\mathbf {S}}: p \times p\) is the precision matrix and \({\varvec{b}}\) the potential vector of a multivariate Gaussian in canonical form. In the literature (Bickson 2008), GaBP has been compared favourably to the Jacobi and Gauss-Seidel methods as a solver of large and sparse linear systems, but faces tough competition from other methods such as the conjugate gradient (CG) and preconditioned conjugate gradient (PCG) algorithms. GaBP does not converge for all positive definite precision matrices and, in the case of convergence, the posterior precisions are not necessarily equal to the marginal precisions (Malioutov et al. 2006). However, if GaBP converges, the posterior means are equal to the marginal means (Weiss and Freeman 2001). We label the application of our regularized message passing on a Gaussian MG in canonical form slow Gaussian belief propagation (sGaBP). We show that sGaBP will converge given sufficient regularization and provide posterior means equal to the marginal means. This article includes empirical comparisons with other GaBP variants as well as with the CG solver. The results indicate that sGaBP converges faster than variants of GaBP and provides more accurate approximations to the true marginals. Our simulations show that sGaBP can be comparable to CG (which provides no precision estimates); we make some suggestions on how to further accelerate sGaBP and comment on the use of preconditioning for both methods. In our concluding remarks, we discuss regularized message passing in a broader class of optimization and marginalization problems (beyond the Gaussian context).

2 Literature review

In the context of error-correcting codes, the roots of BP can be traced back to the development of the sum-product algorithm as a decoding algorithm for LDPC codes (Gallager 1963). Belief propagation (Probability propagation) for Bayesian networks was introduced by Pearl (1988), Shachter (1988), Shafer and Shenoy (1990), Lauritzen and Spiegelhalter (1988) and later found to be equivalent to the sum-product algorithm (Aji and McEliece 2000; Frey and Kschischang 1996). BP is known to provide exact inference on tree-structured graphs but may fail to converge or may converge to incorrect marginals in the case of loop graphs (Pearl 1988; Weiss 2000). However, BP can still be a useful tool as an approximate inference algorithm on loopy structures (Weiss 2000). Early work on GaBP in loopy graphs can be found in Weiss and Freeman (2001). Important contributions made here include:

  1. 1.

    If GaBP converges, the posterior node potentials contain the correct marginal means.

  2. 2.

    An interesting representation of the computations in loopy GaBP as GaBP applied on a tree-structured precision matrix (known as unwrapped or computation trees).

  3. 3.

    A precision matrix \({\mathbf {S}}: p \times p\) is called diagonally dominant if \(S_{ii} > \sum _{j \ne i}|S_{ji}|\) for \(i = 1,2,\ldots ,p\). Diagonal dominance of a precision matrix is a sufficient condition for the convergence of GaBP.

The spectral radius of a matrix, \({\mathbf {S}}: p \times p\), with eigenvalues \(\tau _{i}: i = 1,2,,p\) is defined to be,

$$\begin{aligned} \rho ({\mathbf {S}}) = \mathop {\text {max}}_{i} \{ |\tau _{i}| \}. \end{aligned}$$
(1)

Suppose \({\mathbf {S}}\) is symmetric, positive definite and normalized to have only ones along its diagonal, that is \({\mathbf {S}} = {\mathbf {I}} - {\mathbf {R}}\) where \(\text {diag}({\mathbf {R}}) = {\mathbf {0}}\). Let \(|{\mathbf {R}}|\) be the matrix with entries equal to the absolute values of the entries of \({\mathbf {R}}\). The matrix \({\mathbf {S}}\) is walk-summable if and only if the spectral radius of \(|{\mathbf {R}}|\) is less than one. The class of precision matrices for which GaBP converges was expanded to include positive definite symmetric matrices which are walk-summable, but may still converge for other precision matrices (Malioutov et al. 2006). In general, the converged posterior distributions do not give the exact marginal precisions. In the walk-summable case, this is because the computation trees do not cover all the walks present in the expansion \({\mathbf {S}}^{-1} = ({\mathbf {I}}-{\mathbf {R}})^{-1} = \sum _{k=0}^{\infty }{\mathbf {R}}^{k}\) (Malioutov et al. 2006). The posterior precisions can still be useful approximations for the marginal precisions. Several variants of GaBP have been proposed in the literature (Johnson et al. 2009; El-Kurdi et al. 2012a) to improve on the convergence performance of the original GaBP. These methods are aimed at accelerating GaBP and/or achieving convergence in cases where ordinary GaBP diverges. These variants emphasize the distributive implementation of GaBP for solving large systems of linear equations. A variety of sufficient conditions for the convergence of GaBP has been proposed in the literature (Weiss and Freeman 2001; Malioutov et al. 2006). A recent work provides necessary and sufficient conditions for the convergence of synchronous GaBP under a specified initialization set (Su and Wu 2015). Furthermore, necessary and sufficient conditions are established for damped synchronous GaBP and they include the allowable range for the damping factor. A further contribution is the theoretical confirmation that damping can improve the convergence behaviour of GaBP. Applications of GaBP include MMSE multi-user detection, equalization and channel estimation in communication systems (Montanari et al. 2006; Guo and Huang 2011; Guo and Li 2008), a fast solver for systems of linear equations (El-Kurdi et al. 2012b; Shental et al. 2008), sparse Bayesian learning in large-scale compressed sensing problems (Seeger and Wipf 2010), and estimation on Gaussian graphical models (Chandrasekaran et al. 2008; Liu et al. 2012).

3 Message update rules

Before turning to our high-level approach, we make some comments on message update rules within the BP context. Bickson (2008) describes two conventional types of message update rules. In synchronous message passing, new messages are formed using messages from the previous round only and are therefore not influenced by the message scheduling. This is in contrast to the asynchronous case where messages updated in the current round are used to compute new messages. Although asynchronous updates tend to outperform the synchronous approach (Koller and Friedman 2009), our main focus will be on the synchronous case. We do this since one of the more attractive properties of GaBP is its application in distributive settings which is far more compatible with synchronous message updates. Synchronous implementation also allows us to compare different GaBP algorithms without considering the effects of different message schedulings. We do, however, include a section with comments on asynchronous message updates.

4 High-level approach

Our high-level approach is based on the max-product belief propagation algorithm. We will restrict our discussion to synchronous message updates. Suppose we want to find the mode of a density function \(f({\mathbf {x}})\) with the expansion,

$$\begin{aligned} f({\mathbf {x}}) = e^{K}\prod _{i=1}^{p}\delta _{i}({\mathbf {x}}_{i}) \times \prod _{i\ne j}^{p}g_{ij}({\mathbf {x}}_{i},{\mathbf {x}}_{j}), \end{aligned}$$
(2)

where \({\mathbf {x}} = ({\mathbf {x}}_{1},{\mathbf {x}}_{2},\ldots ,{\mathbf {x}}_{p})\), \({\mathbf {x}}_{i}\) may be higher dimensional and \(e^{K}\) is a normalization constant. The max-product algorithm operates on

$$\begin{aligned} l({\mathbf {x}}) = \text {log}(f({\mathbf {x}})) =K+ \sum _{i=1}^{p}\phi _{i}({\mathbf {x}}_{i}) + \sum _{i\ne j}^{p}h_{ij}({\mathbf {x}}_{i},{\mathbf {x}}_{j}), \end{aligned}$$
(3)

where \(\phi _{i} = \text {log}(\delta _{i})\) and \(h_{ij} = \text {log}(g_{ij})\). Equations 2 and 3 correspond to a Markov graph with p nodes \(\mathcal {H}_{i}: i = 1,2,\ldots ,p\). We assign to node \(\mathcal {H}_{i}\) the vector \({\mathbf {x}}_{i}\). Node i and node j are linked if \(h_{ij}({\mathbf {x}}_{i},{\mathbf {x}}_{j})\) is not zero for certain \({\mathbf {x}}_{i}\), \({\mathbf {x}}_{j}\). Let \({\mathcal {N}}_{i}\) be the set containing the neighbours of node i (we do not include i in this set), that is all nodes to which i has a link. Suppose we are at stage n of a synchronous max-product belief propagation algorithm with messages \(m_{ij}^{(n)}(.)\) for \(i = 1,2,\ldots ,p\) and \(j \in {\mathcal {N}}_{i}\). The updated messages are

(4)

where by \({\mathcal {N}}_{i}/j\) we mean the set \({\mathcal {N}}_{i}\) with j removed. Suppose at stage \(n-1\), the posterior mode is \({\varvec{\mu }}^{(n-1)} = ({\varvec{\mu }}_{1}^{(n-1)},{\varvec{\mu }}_{2}^{(n-1)},\ldots ,{\varvec{\mu }}_{p}^{(n-1)})'\). By node regularization, we mean that the optimization problem in Eq. 4 is replaced by,

$$\begin{aligned} m_{ij}^{(n+1)}({\mathbf {x}}_{j};\lambda )&= \mathop {\text {max}}_{{\mathbf {x}}_{i}} \bigg \{ \phi _{i}({\mathbf {x}}_{i}) + h_{ij}({\mathbf {x}}_{i},{\mathbf {x}}_{j}) \nonumber \\&\quad +\sum _{k \in {\mathcal {N}}_{i}/j}m_{ki}^{(n)}({\mathbf {x}}_{i})\nonumber \\&\quad -\frac{\lambda }{2} \bigg [\parallel {\mathbf {x}}_{i} -{\varvec{\mu }}_{i}^{(n-1)} \parallel _{q}\bigg ]^{q} \bigg \}, \end{aligned}$$
(5)

with \(\lambda \) a scalar and \(\parallel .\parallel _{q}\) the \(L_{q}\) norm of a vector. The use of \({\varvec{\mu }}^{(n-1)}\) instead of \({\varvec{\mu }}^{(n)}\) when propagating messages at stage n (that is computing the stage \(n+1\) messages) is important. To understand the rationale behind node regularization, one needs to understand one of the fundamental problems in belief propagation, that is the problem of loopy graphs. Consider a node i in a Markov graph and suppose there is at least one path \(P_{i}\) through the graph back to node i. Sending messages through this path means that the prior potential of node i is cycled back to this node. This causes node i to continuously increase belief in its prior mode and this may cause either divergence or incorrect convergence. The idea behind the penalty in (5) is to slow down this increase in belief, hence the name slow Gaussian belief propagation. Although the formulation in (5) seems to make sense for \(\lambda \ge 0\) there is also a role for negative values. Negative values of \(\lambda \) correspond to a relaxation effect on the message propagation where nodes are encouraged to favour their own prior beliefs. In the context of GaBP, negative values of \(\lambda \) relate sGaBP to RGaBP (relaxed GaBP). In the empirical section, we provide an indication of when it is appropriate to use a negative \(\lambda \) and provide a comparison of RGaBP and sGaBP in this context.

5 Slow Gaussian belief propagation message updates

We use (5) to derive the message updates in the Gaussian context. It is natural to use \(q = 2\) in (5) since this preserves the conjugacy of the messages. For the Gaussian distribution, we have

$$\begin{aligned} \phi _{i}({\mathbf {x}}_{i})&= -\frac{1}{2}{\mathbf {x}}_{i}'{\mathbf {S}}_{ii}{\mathbf {x}}_{i} + {\mathbf {x}}_{i}'{\varvec{b}}_{i}. \nonumber \\ h_{ij}({\mathbf {x}}_{i})&= -{\mathbf {x}}_{i}{\mathbf {S}}_{ij}{\mathbf {x}}_{j}. \end{aligned}$$
(6)

We assume that the messages are of the form,

$$\begin{aligned} m_{ij}^{(n)}({\mathbf {x}}_{j}) = -\frac{1}{2}{\mathbf {x}}_{j}'{\mathbf {Q}}_{ij}^{(n)}{\mathbf {x}}_{j} + {\mathbf {x}}_{j}'\mathbf {v}^{(n)}_{ij} + C^{(n)}_{ij}, \end{aligned}$$
(7)

for all \(i \ne j\) and \(C_{ij}^{(n)}\) is a constant. Applying (7) to (5) we obtain,

$$\begin{aligned} m_{ij}^{(n+1)}({\mathbf {x}}_{j};\lambda )&= \mathop {\text {max}}_{{\mathbf {x}}_{i}} \bigg \{ -\frac{1}{2}{\mathbf {x}}_{i}' ({\mathbf {S}}_{ii}+ \sum _{k \in {\mathcal {N}}_{i}/j}{\mathbf {Q}}_{ki}^{(n)}){\mathbf {x}}_{i} \nonumber \\&\quad + {\mathbf {x}}_{i}'({\varvec{b}}_{i}-{\mathbf {S}}_{ij}{\mathbf {x}}_{j}+ \sum _{k \in {\mathcal {N}}_{i}/j}\mathbf {v}_{ki}^{(n)}) \nonumber \\&\quad - \frac{\lambda }{2}({\mathbf {x}}_{i} -{\varvec{\mu }}_{i}^{(n-1)})'({\mathbf {x}}_{i} -{\varvec{\mu }}_{i}^{(n-1)}) \nonumber \\&\quad + \sum _{k \in {\mathcal {N}}_{i}/j}C_{ki}^{(n)} \bigg \}. \end{aligned}$$
(8)

Since \(\frac{\lambda }{2}({\mathbf {x}}_{i} -{\varvec{\mu }}_{i}^{(n-1)})'({\mathbf {x}}_{i} -{\varvec{\mu }}_{i}^{(n-1)}) = \frac{\lambda }{2}{\mathbf {x}}_{i}'{\mathbf {x}}_{i} - \lambda {\mathbf {x}}_{i}'{\varvec{\mu }}_{i}^{(n-1)} + \frac{1}{2}||{\varvec{\mu }}_{i}^{(n-1)}||^{2} \),

$$\begin{aligned} m_{ij}^{(n+1)}({\mathbf {x}}_{j};\lambda )&= \mathop {\text {max}}_{{\mathbf {x}}_{i}} \bigg \{ -\frac{1}{2}{\mathbf {x}}_{i}' (\lambda {\mathbf {I}} + {\mathbf {S}}_{ii} + \sum _{k \in {\mathcal {N}}_{i}/j}{\mathbf {Q}}_{ki}^{(n)}){\mathbf {x}}_{i} \nonumber \\&\quad + {\mathbf {x}}_{i}'({\varvec{b}}_{i} + \lambda {\varvec{\mu }}_{i}^{(n-1)}-{\mathbf {S}}_{ij}{\mathbf {x}}_{j}\nonumber \\&\quad +\sum _{k \in {\mathcal {N}}_{i}/j}\mathbf {v}_{ki}^{(n)}) + \tilde{C}_{ij} \bigg \}, \end{aligned}$$
(9)

where \(\tilde{C}_{ij}\) is a constant. For convenience we set \({\mathbf {A}}_{ij}^{(n)} = \lambda {\mathbf {I}} + {\mathbf {S}}_{ii} + \sum _{k \in {\mathcal {N}}_{i}/j}{\mathbf {Q}}_{ki}^{(n)}\) and \({\mathbf {a}}_{ij}^{(n)} = {\varvec{b}}_{i} + \lambda {\varvec{\mu }}_{i}^{(n-1)}+ \sum _{k \in {\mathcal {N}}_{i}/j}\mathbf {v}_{ki}^{(n)}\). We now need to compute

$$\begin{aligned}&m_{ij}^{(n+1)}({\mathbf {x}}_{j};\lambda ) \nonumber \\&\quad =\mathop {\text {max}}_{{\mathbf {x}}_{i}} \bigg \{ -\frac{1}{2}{\mathbf {x}}_{i}' {\mathbf {A}}_{ij}^{(n)}{\mathbf {x}}_{i} + {\mathbf {x}}_{i}'({\mathbf {a}}_{ij}^{(n)}-{\mathbf {S}}_{ij}{\mathbf {x}}_{j}) + \tilde{C}_{ij} \bigg \}. \end{aligned}$$
(10)

It is easy to show that substituting \({\mathbf {x}}_{i} = [{\mathbf {A}}_{ij}^{(n)}]^{-1}({\mathbf {a}}_{ij}^{(n)}-{\mathbf {S}}_{ij}{\mathbf {x}}_{j})\) into \(-\frac{1}{2}{\mathbf {x}}_{i}' {\mathbf {A}}_{ij}^{(n)}{\mathbf {x}}_{i} + {\mathbf {x}}_{i}'({\mathbf {a}}_{ij}^{(n)}-{\mathbf {S}}_{ij}{\mathbf {x}}_{j}) + \tilde{C}_{ij}\) gives (10). We now make the assumption that \({\mathbf {Q}}_{ij}^{(n)}\) is symmetric for all \(i \ne j\). We have,

$$\begin{aligned} m_{ij}^{(n+1)}({\mathbf {x}}_{j};\lambda )&= \frac{1}{2}({\mathbf {a}}_{ij}^{(n)} -{\mathbf {S}}_{ij}{\mathbf {x}}_{j})'[{\mathbf {A}}_{ij}^{(n)}]^{-1}({\mathbf {a}}_{ij}^{(n)}-{\mathbf {S}}_{ij}{\mathbf {x}}_{j}) \nonumber \\&\quad \, + \tilde{C}_{ij} \nonumber \\&= \frac{1}{2}{\mathbf {x}}_{j}'{\mathbf {S}}_{ji}[{\mathbf {A}}_{ij}^{(n)}]^{-1}{\mathbf {S}}_{ij}{\mathbf {x}}_{j} \nonumber \\&\quad \, -{\mathbf {x}}_{j}'{\mathbf {S}}_{ji}[{\mathbf {A}}_{ij}^{(n)}]^{-1}{\mathbf {a}}_{ij}^{(n)} + C^{(n+1)}_{ij}\nonumber \\&= -\frac{1}{2}{\mathbf {x}}_{j}'{\mathbf {Q}}_{ij}^{(n+1)}{\mathbf {x}}_{j} + {\mathbf {x}}_{j}'\mathbf {v}^{(n+1)}_{ij} + C^{(n+1)}_{ij}, \end{aligned}$$
(11)

where \({\mathbf {Q}}_{ij}^{(n+1)} = -{\mathbf {S}}_{ji}[{\mathbf {A}}_{ij}^{(n)}]^{-1}{\mathbf {S}}_{ij}\), \(\mathbf {v}^{(n+1)}_{ij}= -{\mathbf {S}}_{ji}[{\mathbf {A}}_{ij}^{(n)}]^{-1}{\mathbf {a}}_{ij}^{(n)}\) and \(C^{(n+1)}_{ij}\) does not depend on \({\mathbf {x}}_{j}\). In the literature, it is common practice to ignore the update of the constant-components of messages since they are not needed to update \({\mathbf {Q}}_{ij}^{(n+1)}\) and \(\mathbf {v}^{(n+1)}_{ij}\) and we will follow this convention. We note that all the assumptions made in this section are recurring and can therefore be ensured by appropriate initialization. Our focus will be on one-dimensional nodes. Here, all the matrices are replaced by scalars and our message updates are performed by

$$\begin{aligned} Q_{ij}^{(n+1)}&= \frac{-S_{ij}^{2}}{\lambda + S_{ii} + \sum _{k \in {\mathcal {N}}_{i}/j}Q_{ki}^{(n)}}. \end{aligned}$$
(12)
$$\begin{aligned} V_{ij}^{(n+1)}&= \frac{Q_{ij}^{(n+1)}}{S_{ij}} \left[ \lambda \mu _{i}^{(n-1)} + b_{i} + \sum _{k \in {\mathcal {N}}_{i}/j}V_{ki}^{(n)} \right] . \end{aligned}$$
(13)

In order to ensure the convergence of sGaBP, while preserving the exactness of the posterior means, the implementation of the algorithm requires some additional steps (beyond the message passing). The implementation of these steps and the convergence properties of sGaBP are discussed in the next section.

6 Convergence analysis

The synchronous implementation of sGaBP is given in Algorithm 1 where certain steps are discussed in this section. We start by discussing the computation of the posterior distributions after each iteration. These computations are important since they are sufficient to ensure convergence of sGaBP (for large enough \(\lambda \)) while preserving the posterior means as the exact marginal means. This is followed by a study of the convergence behaviour of the precision components of the messages, i.e. the behaviour of \({\mathbf {Q}}^{(n)}\) in Algorithm 1. We then proceed to the mean/potential components of the messages by assuming convergence of the precision components. In our convergence analysis, we assume that the precision matrix has been preconditioned to have 1 on the diagonals. If the precision matrix is \({\mathbf {S}}\) (before preconditioning), this can be achieved by setting \({\mathbf {D}} = \text {diag}(\frac{1}{\sqrt{S_{11}}},\frac{1}{\sqrt{S_{22}}},\ldots ,\frac{1}{\sqrt{S_{pp}}})\) and computing \({\mathbf {DSD}}\). This type of preconditioning does not entail any loss of information in the sense that both the marginal means and precisions of the distribution in its original scale can be recovered.

6.1 Computation of posterior distributions

In order to ensure convergence of sGaBP and to have the posterior means (at convergence) equal to the correct marginal means, it is necessary to adjust the manner in which posterior distributions are computed. Consider the computation of the posterior distribution of node i at stage n. As a first step, we instruct node i to collect all incoming messages, which can be characterized by the parameters \(\sum _{t\ne i}Q_{ti}^{(n)}\) (precision components) and \(\sum _{t\ne i}V_{ti}^{(n)}\) (mean/potential components). We suggest keeping the posterior precisions as in normal belief propagation, that is \(1 + \sum _{t\ne i}Q_{ti}^{(n)}\). Later, we investigate the role of \(\lambda \) in the tuning of the posterior precisions to better approximate the marginal precisions. The posterior mean of node i at stage n is given by

$$\begin{aligned} \mu _{i}^{(n)} = \frac{\lambda \mu ^{(n-1)}_{i} + z_{i}^{(n)}}{\lambda + q_{i}^{(n)}} = \gamma _{i}^{(n)}\mu ^{(n-1)}_{i} + (1-\gamma _{i}^{(n)})\frac{z_{i}^{(n)}}{q_{i}^{(n)}}\qquad \end{aligned}$$
(14)

where \(z_{i}^{(n)} = b_{i} +\sum _{t\ne i}V_{ti}^{(n)} \), \(q_{i}^{(n)} = 1 + \sum _{t\ne i}V_{ti}^{(n)}\) and \(\gamma _{i}^{(n)} = \frac{\lambda }{\lambda + q_{i}^{(n)}}\). Note that \(\frac{z_{i}^{(n)}}{q_{i}^{(n)}}\) is the posterior mean we would have computed if no adjustment was made to the computation of the posterior distribution. Hence, we can interpret (14) as damping between the posterior mean, under normal belief propagation, and the posterior mean computed in the previous round. What is nice here is that these damping factors are computed automatically (using \(\lambda \) and the current posterior precisions) and no additional parameters are required. The values, \(\gamma _{i}^{(n)}: i = 1,2,\ldots ,p\), can also be relaxation factors which correspond to negative \(\lambda \). We now show that these adjustments are sufficient for the convergence and the preservation of the (converged) posterior means as the exact marginal means.

figure a

6.2 The precision components

The convergence analysis of the precision components is the simpler of the two since we can apply results found in the literature (Malioutov et al. 2006; Bickson 2008). Suppose \({\mathbf {Q}}^{(n)}(\lambda )\) holds the precision components of the messages at iteration n. The analysis of \({\mathbf {Q}}^{(n)}(\lambda )\) is simple since it is identical to the precision components provided by ordinary GaBP applied on the matrix \(\lambda {\mathbf {I}} + {\mathbf {S}}\). Therefore, we only need to select \(\lambda \) large enough such that \(\lambda {\mathbf {I}} + {\mathbf {S}}\) is walk-summable (although smaller selections of \(\lambda \) can also suffice). From this point onwards, we use the symbol \(\tilde{\rho }({\mathbf {S}}) = \rho ({\mathbf {I}}-{\mathbf {S}})\), we also refer to \(\tilde{\rho }({\mathbf {S}})\) as the zero-diagonal spectral radius of \({\mathbf {S}}\). Selecting \(\lambda > \tilde{\rho }(|{\mathbf {S}}|)-1 = \rho (|{\mathbf {R}}|)-1\), where \({\mathbf {S}} = {\mathbf {I}}-{\mathbf {R}}\), is sufficient for the precision components in Algorithm 1 to converge. In addition to this, we prove Theorem 1 in “Appendix 1”.

Theorem 1

The following properties recur indefinitely (as a set) in sGaBP.

  1. 1.

    \(Q_{ij}^{(n)} \le 0\) for all i, \(j \in {\mathcal {N}}_{i}\).

  2. 2.

    \(|Q_{ij}^{(n)}| > |Q_{ij}^{(n-1)}|\) for all i, \(j \in {\mathcal {N}}_{i}\).

  3. 3.

    \(\delta _{i}^{(n)} = \sum _{t \in {\mathcal {N}}_{i}}|Q_{ti}^{(n)}| \le \delta _{i}\) for a \(0 \le \delta _{i} < 1 + \lambda \) and all i.

  4. 4.

    \(\sum _{t \in {\mathcal {N}}_{i}} \frac{S_{ti}^{2}}{1 + \lambda - \delta _{t} + |Q_{it}^{(n)}|} \le \delta _{i}\) for all i.

If these conditions hold, then \(Q_{ij}^{(n)}\) are monotone decreasing and bounded from below by \(\frac{-S_{ij}^{2}}{1 + \lambda - \delta _{i}}\) and will therefore converge. Consider the case where \(\delta _{i} = \delta \) for all i and suppose we want the conditions in Theorem 1 to hold from \(n=0\). Since \(Q_{ij}^{(0)} = 0\) we need a \(0 \le \delta < 1 + \lambda \) satisfying \(\sum _{i \ne j}\frac{S_{ij}^{2}}{1+ \lambda - \delta } \le \delta \) for all j. This inequality is equivalent to a quadratic inequality with roots,

$$\begin{aligned} \frac{(1+\lambda ) \pm \sqrt{(1+\lambda )^{2} - 4 \sum _{i \ne j}S_{ij}^{2}}}{2}. \end{aligned}$$
(15)

If we select \((1+\lambda )^{2} - 4 \times \text {max}_{j} \bigg \{ \sum _{i \ne j}S_{ij}^{2} \bigg \} \ge 0\) or \(\lambda \ge 2 \sqrt{\text {max}_{j} \bigg \{ \sum _{i \ne j}S_{ij}^{2} \bigg \}} - 1\), then we can select \(\delta = \frac{1+\lambda }{2}\) to guarantee monotone convergence of all the precisions. For this selection, the bounds on the precisions are

$$\begin{aligned} - \frac{2S_{ij}^{2}}{1+\lambda }\le Q_{ij}^{(n)} \le 0. \end{aligned}$$
(16)

An important consequence of (16) is that

$$\begin{aligned} \mathop {\text {lim}}_{\lambda \rightarrow \infty } Q_{ij}^{(n)} = 0, \end{aligned}$$
(17)

for all \(i \ne j\). Here, we emphasize the role of \(\lambda \) in the tuning of the posterior precisions. Note that we can tune the converged precisions, \(Q_{ij}\), to any value in the interval \([- \frac{2S_{ij}^{2}}{1+\lambda _{0}};0]\) (this interval can be much larger) where \(\lambda _{0} = 2 \sqrt{\text {max}_{j} \bigg \{ \sum _{i \ne j}S_{ij}^{2} \bigg \}} - 1\), although there is dependence among the \(Q_{ij}\)’s in terms of \(\lambda \). This can in turn be used to tune the posterior precisions, \(1+\sum _{t \ne i}Q_{ti}\), under certain restrictions. The tuning can be made more flexible by introducing multiple tuning parameters.

Fig. 1
figure 1

Plot of the spectral radius of the linear-update matrix as a function of \(\lambda \) for a simulated \(10 \times 10\) matrix with a zero-diagonal spectral radius equal to one. The red solid line corresponds to a complex spectral radius. The black broken line corresponds to a real spectral radius

6.3 The mean components

In the previous section, we saw that the precision components of the messages will converge for sufficiently large choices of \(\lambda \). In this section, we proceed under the assumption that the precision components have converged. We denote the converged precision message-components, posterior precisions and damping factors by \(Q_{ij}\), \(q_{i}\) and \(\gamma _{i}\), respectively. The updates of the mean components are

$$\begin{aligned} V_{ij}^{(n+1)} = \frac{Q_{ij}}{S_{ij}}\bigg [ \lambda \mu _{i}^{(n-1)} + b_{i} + \sum _{t \in {\mathcal {N}}_{i}/j}V_{ti}^{(n)} \bigg ]. \end{aligned}$$
(18)

We define \(\varvec{\theta }^{(n+1)}\) to be the vector obtained by stacking the columns of \({\mathbf {V}}^{(n+1)}\), removing the diagonal entries and appending \({\varvec{\mu }}^{(n)}\) (after the columns of \({\mathbf {V}}^{(n+1)}\)). This vector can be expressed as,

$$\begin{aligned} \varvec{\theta }^{(n+1)} = \varvec{\theta } + {\mathbf {L}}\varvec{\theta }^{(n)}, \end{aligned}$$
(19)

for a matrix \({\mathbf {L}}: p^{2} \times p^{2}\) and a vector of constants \(\varvec{\theta }: p^{2} \times 1\). The first \(p^{2} - p\) entries of \(\varvec{\theta }\) can be obtained by constructing the matrix \({\mathbf {C}} = [\frac{Q_{ij}}{S_{ij}} b_{i}]\), with the understanding that the diagonals are zero, and stacking the columns in the same way as we did with the mean precision components. The final p entries of \(\varvec{\theta }\) are \(\frac{1-\gamma _{i}}{q_{i}}b_{i}\) in order \(i = 1,2,\ldots ,p\). The construction of \({\mathbf {L}}\) is more complex. Consider,

$$\begin{aligned} {\mathbf {L}}: p^{2} \times p^{2} = \begin{bmatrix} {\mathbf {L}}_{11}: l \times l&{\mathbf {L}}_{12}: l \times p \\ {\mathbf {L}}_{21}: p \times l&{\mathbf {L}}_{22}: p \times p \end{bmatrix}, \end{aligned}$$
(20)

where \(l = p^{2} - p\). Consider one of the first l elements of \(\varvec{\theta }^{(n+1)}\), say m. This element corresponds to an entry in the matrix \({\mathbf {V}}^{(n+1)}\), say \(V_{ij}^{(n+1)}\). The next step is to identify the neighbours of i, that is the set \({\mathcal {N}}_{i}\). For each \(k \in {\mathcal {N}}_{i}/j\), we find the element in \(\varvec{\theta }^{(n)}\) corresponding to \(V_{ki}^{(n)}\) and note its position. The entry in row m of \({\mathbf {L}}\) in this position is \(\frac{Q_{ij}}{S_{ij}}\). This accounts for the matrix \({\mathbf {L}}_{11}\) with the understanding that all elements not accessed are zero. Continuing with this notation, the entry in row m of \({\mathbf {L}}_{12}\) in position i is \(\frac{\lambda Q_{ij}}{S_{ij}}\) and all other elements in this row are zero. We see that \({\mathbf {L}}_{22}\) is a diagonal matrix with entries \(\gamma _{i}\) in order \(i = 1,2,\ldots ,p\). Consider the matrix \({\mathbf {L}}_{21}\). The first step is to identify the neighbours of node i, that is \({\mathcal {N}}_{i}\). We then move along the vector \(\varvec{\theta }^{(n)}\) and identify all the positions corresponding to \(V_{ti}^{(n)}: t \in {\mathcal {N}}_{i}\). In row i of \({\mathbf {L}}_{21}\), we place the value \(\frac{1-\gamma _{i}}{q_{i}}\) in the identified positions, the rest of the entries are zero.

Our goal is to analyse the spectral radius of \({\mathbf {L}}\). We note that the eigenvalues of \({\mathbf {L}}\) can possibly be complex. In the case of complex eigenvalues, the spectral radius of \({\mathbf {L}}\) is defined to be the largest modulus among the eigenvalues of \({\mathbf {L}}\). If the spectral radius of \({\mathbf {L}}\) is less than 1, sGaBP will converge (assuming that the precisions converge). The value of the spectrum has a heavy influence on the convergence speed of sGaBP and can play a role in deciding on how to select \(\lambda \). A natural way to select the amount of regularization is to seek \(\lambda \) such that the spectral radius (of \({\mathbf {L}}\)) is a minimum. We make some comments on the form of the spectrum later in this section. For the purpose of this article, we consider the asymptotic behaviour of the spectral radius and show that the spectral approaches 1 from below as \(\lambda \rightarrow \infty \). The selection of \(\lambda \) is considered in the section on heuristic measures. Theorem 2 provides information on the asymptotic behaviour of the spectrum, the proof is given in “Appendix 1”.

Theorem 2

Consider sGaBP applied to a multivariate Gaussian with potential \({\varvec{b}}\) and precision matrix \({\mathbf {S}}\). The eigenvalues of the linear-update matrix can be characterized as,

$$\begin{aligned}&1 - \frac{\sigma _{i}}{\lambda } + {\mathcal {O}}\left( \frac{1}{\lambda ^{2}}\right) ;\quad i = 1,2,\ldots ,p \end{aligned}$$
(21)
$$\begin{aligned}&\frac{\pm S_{ij}}{\lambda } + {\mathcal {O}}\left( \frac{1}{\lambda ^{2}}\right) ;\quad i \ne j, \end{aligned}$$
(22)

where \(\sigma _{i},i=1,2,\ldots ,p\) represent the eigenvalues of \({\mathbf {S}}\).

In particular, we see that if \({\mathbf {S}}\) is positive definite, the maximum of the eigenvalues in Theorem 2 tends to 1 from below as \(\lambda \rightarrow \infty \). We see that the precisions will converge for \(\lambda \) large enough and will eventually generate a linear-update matrix with a spectral radius less than 1, that is sGaBP will converge for large enough \(\lambda \). In “Appendix 1”, we show that the posterior means provided by sGaBP (under the assumption of convergence) provide the exact marginal means.

We now make some comments on the behaviour of the spectrum (eigenvalues) of \({\mathbf {L}}\). One interesting aspect of the spectrum is when sGaBP is applied to tree-structured precision matrices. We generated a few of these and in each case we found the matrix \({\mathbf {L}}\) to be nilpotent when \(\lambda = 0\). This relates to BP as an efficient and exact marginalization algorithm on tree structures. The use of values of \(\lambda \) other than zero is nonsensical in this case. A typical plot of the spectral radius as a function of \(\lambda \) is given in Fig. 1. In this case, the spectral radius has a global minimum at a value of \(\lambda \) just under 0.4. The spectral radius can correspond to either a complex or a real eigenvalue and the graph of the spectral radius seems to change curvature when the eigenvalue responsible for the spectral radius changes from real to complex (and vice versa). This can be seen in Fig. 1 with the red solid line corresponding to a complex spectral radius and the black broken line to a real spectral radius. We also see that the spectral radius eventually becomes real, which is consistent with Theorem 2. Another important observation is that the value of \(\lambda \) which minimizes the spectral radius seems to occur at a point where the eigenvalue responsible switches from real to complex or vice versa. Furthermore, there can be more than one point where this change occurs. Our simulations show similar results for other precision matrices. The interaction between complex and real eigenvalues could prove useful in the minimization of the spectral radius and should be considered in further research.

6.4 The converged posteriors

Having proved convergence of sGaBP, we now turn to the posterior distributions as approximations of the marginal distributions. In Theorem 3, we prove that the posterior means are the exact marginal means, the proof is provided in “Appendix 1”. A consequence of Theorem 3 is that sGaBP can be used to solve linear systems, \({\mathbf {S}}{\varvec{\mu }} = {\varvec{b}}\), as long as \({\mathbf {S}}\) is a valid precision matrix. Unfortunately, the posterior precisions are not necessarily equal to the true marginal precisions. In the experimental section, we show that the posterior precisions provided by sGaBP can be useful as approximate quantities in the sense that the KL distance between the posterior and marginal distributions can be small. Included in our empirical study are two variants of GaBP namely Relaxed Gaussian belief propagation (RGaBP) and Convergence Fix Gaussian belief propagation (CF). All three methods (sGaBP, RGaBP and CF) require the specification of at least one hyper-parameter. We compare the precisions provided by these 3 methods by finding the values of the hyper-parameters yielding the fastest convergence. For RGaBP, the value of the hyper-parameter is irrelevant in terms of comparing precisions since the posterior precisions provided are identical to those provided by ordinary GaBP. Within this set-up, we show empirically that sGaBP can provide substantially more accurate approximations of the marginal precisions (compared to RGaBP and CF) while also converging at faster rates.

7 Heuristic measures

In this section, we propose some heuristic measures for the selection of \(\lambda \). These measures vary in degree of complexity and we discuss some of their advantages and disadvantages.

7.1 Search heuristic

figure b

This heuristic is basically the same as the one proposed by El-Kurdi et al. (2012a), adjusted for sGaBP, and is given in Algorithm 2. The main advantage of this heuristic is that it is easy to implement. There are some drawbacks to this measure arising from the monotone way in which the tuning is adjusted. When the current tuning provides posterior means with a smaller (larger) error, the heuristic will always decrement (increment) the tuning. The heuristic seeks tuning for which the spectral radius of \({\mathbf {L}}\) is less than one and not necessarily tuning for which the value of the spectral radius is a minimum.

7.2 Gradient descent heuristic (GDH)

GDH is more complex to implement, but does not have the monotonicity of SH as described in Sect. 7.1. The heuristic is aimed at determining the direction in which the tuning needs to be adjusted to achieve the smallest possible spectral radius. The tuning is then adjusted in this direction in step sizes which should not be overly large.

Suppose we have completed iteration n of sGaBP and we are preparing to perform the next round of updates. We wish to adjust the value of the tuning parameter in a direction which yields faster convergence. At this point, we have the posterior precisions \(q_{i}^{(n)}: i = 1,2,\ldots ,p\) and posterior means \(\mu _{i}^{(n)}\). The posterior means were computed using

$$\begin{aligned} \mu _{i}^{(n)}&= \frac{q_{i}^{(n)}}{\lambda + q_{i}^{(n)}}\frac{b_{i} + \sum _{j \in {\mathcal {N}}_{j}}V^{(n)}_{ji}}{q_{i}^{(n)}} + \frac{\lambda }{\lambda + q_{i}^{(n)}} \mu ^{(n-1)}_{i} \nonumber \\&= [1-\gamma _{i}^{(n)}(\lambda )]\tilde{\mu }^{(n)}_{i} + \gamma _{i}^{(n)}(\lambda ) \mu ^{(n-1)}_{i}. \end{aligned}$$
(23)

Although this is not technically correct, we assume that \( q_{i}^{(n)},\mu ^{(n-1)}_{i}\) and \(\tilde{\mu }^{(n)}_{i}\) are constant and do not depend on \(\lambda \). The GDH starts by instructing each node to send its posterior mean to its neighbours, each node then computes \(e_{j} = \sum _{i \in \tilde{{\mathcal {N}}}_{j}}S_{ji}\mu _{i}^{(n)} - b_{j}\), where \(\tilde{{\mathcal {N}}}_{j}\) is \({\mathcal {N}}_{j}\) with node j included. Let \(k = \text {argmax}_{j} \{|e_{j}|\}\). The node k and each of its neighbours are instructed to compute the derivative of their own mean (can be done in parallel) by differentiating (23) relative to \(\lambda \) and evaluating this at the current value of the tuning, say \(\lambda _{0}\). The neighbours of node k send these derivatives to node k and this node computes \(d_{k} = \sum _{j \in \{k,{\mathcal {N}}_{k}\}}S_{kj}\bigtriangledown \mu _{j}^{(n)}\), where \(\bigtriangledown \mu _{j}^{(n)}\) is the derivative received from node j. Node k is then instructed to adjust the tuning \(\lambda _{0} \leftarrow \lambda _{0} - \alpha \text {sign}(d_{k})\), for a specified step size \(\alpha \), and to send this new tuning value to the other nodes.

Fig. 2
figure 2

Comparison of SH and GDH for a simulated data structure. The precision matrix was regulated to have a zero-diagonal spectral radius equal to one. Some relevant quantities are given in the display. We see that the tuning provided by SH is stuck around \(\lambda = 0\). Selecting \(\lambda = 0\) a priori gives a spectral radius equal to one. GDH provides tuning closer to the values yielding the fastest convergence (determined using a line search in increments of 0.01). GDH converges faster than SH

7.3 Comparing SH and GDH: a concrete example

We use simulation to illustrate the possible benefits of using GDH instead of SH. We start by simulating a \(100\times 100\) precision matrix, \({\mathbf {S}}\), and potential vector, \({\varvec{b}}\). We use the method in “Appendix 2” to regulate the zero-diagonal spectral radius of the precision matrix to 1. We defined convergence to occur when the error is less than \(10^{-14}\). Using a line search in increments of 0.01, we observed that initializing the tuning of sGaBP with values 0.33, 0.34 and 0.35 yielded the fastest convergence and that this occurred after 28 iterations. We found that the spectral radius of \({\mathbf {L}}\) is 1 when \(\lambda = 0\) (this is typical when the zero-diagonal spectral radius of \({\mathbf {S}}\) is 1). The values of the spectral radius (of \({\mathbf {L}}\)) corresponding to \(\lambda = -0.01\) and \(\lambda = 0.01\) are 1.029343 and 0.971501, respectively. Assuming convergence of the precision components of the messages, we observed the error to be increasing for negative tuning and decreasing for positive tuning. If SH is used, there is the risk that the heuristic tuning will vary around the tuning corresponding to a spectral radius of one. This is because SH seeks tuning for which the spectral radius of \({\mathbf {L}}\) is less than one and not necessarily tuning which minimizes the spectral radius of \({\mathbf {L}}\). This is illustrated in Fig. 2. The y-axis shows the level of tuning used at the iteration number given on the x-axis. Figure 2 contains two lines for each of GDH and SH. The two lines for each of GDH and SH correspond to the step sizes 0.01 and 0.05. The tuning suggested by SH varies around \(\lambda = 0\), the level of tuning corresponding to a spectral radius of 1. GDH is not restricted around a spectral radius of one and is able to make better adjustments on the tuning. Notice that the two graphs corresponding to GDH are terminated at iteration 45 and 32 corresponding to step sizes 0.01 and 0.05, respectively. This was done to indicate that sGaBP has converged after these numbers of iterations. Both applications of SH failed to converge after 100 iterations. This is not to say that SH cannot be effective, indeed the simplicity of implementation is an advantage over GDH, but rather that SH is more sensitive to the initialization of \(\lambda \), particularly when this starting value is close to the level of tuning yielding a spectral radius of one.

8 Asynchronous message updates

We have referred to the use of asynchronous message updates as opposed to the synchronous version. In general, it is believed that asynchronous message updates can provide better convergence behaviour in applications of BP in the sense that they may induce convergence where synchronous updates diverge or require the passing of a smaller number of messages to converge (Koller and Friedman 2009). The major shortcoming of asynchronous updates is loss of distributive applicability. Another problem posed by asynchronous updates is the problem of deciding upon the order in which messages are passed, since this can have a significant effect on the convergence speed. In the context of GaBP, this problem is compounded by the fact that synchronous messages operate in iterations with \({\mathcal {O}}(p^{2})\) computations, which discounts complicated heuristics used in other applications of BP to decide on the message scheduling. Progress can be made by deciding on the message scheduling in advance. There are other considerations as well, such as deciding on the degree of regularization. This should be considered from the viewpoint that the degree of regularization yielding optimal convergence should naturally provide useful posterior precisions. An example of the advantages of asynchronous message passing can be found in the diabetes data (Efron et al. 2004). The diabetes data were used to illustrate the advantages of the least angle regression algorithm in settings involving a high degree of collinearity among the explanatory variables. Estimating the linear coefficients of the diabetes data is challenging for GaBP since:

  1. 1.

    The number of explanatory variables is small.

  2. 2.

    The zero-diagonal spectral radius of the sample correlation matrix is high (3.024214).

  3. 3.

    There is significant variation among the sample correlations.

figure c

A line search in increments of 0.01 revealed that \(\lambda = 1.29\) yields the fastest convergence for synchronous sGaBP (using a tolerance of \(10^{-10}\)) and convergence occurred after 574 iterations. The 574 iterations required for convergence is substantial when compared to the number of explanatory variables (which is 10). A further complication is that synchronous sGaBP with \(\lambda = 1.29\) yields negative posterior precisions for certain variables, although this can be addressed by increasing \(\lambda \) at the cost of slower convergence. We now apply asynchronous sGaBP, which is formulated in Algorithm 3. Notice that the outer-loop of the message updates iterates over j indicating that the inner-loop iterates over messages to node j. This was done because we found that iterating over incoming messages first was more efficient in our simulations. Each round of message updates requires \({\mathcal {O}}(p^{2})\) computations (fewer with sparsity) as in the synchronous case. Unlike the synchronous case, it is not necessary to store old messages and Algorithm 3 performs damping throughout the double-loop (instead of after). The optimal tuning value was determined as 2.01 (line search as for the synchronous case) and convergence occurred after 131 iterations. All posterior precisions were positive. We see that asynchronous message passing improved on the convergence speed and accuracy of the posterior distributions in the case of the diabetes data. In general, our simulations showed that asynchronous outperforms synchronous sGaBP in terms of convergence behaviour. This is further compounded by the fact that the asynchronous message passing does not require old messages to be stored, resulting in a lower computational burden on each iteration and lower memory requirements. We leave further aspects of asynchronous sGaBP, such as proof of convergence and heuristic measures, for further research.

9 Empirical work

In this section, we provide empirical comparisons of sGaBP with other GaBP variants in the literature as well as with the CG solver. Our empirical work will be summarized using two quantities, that is the number of iterations required by a specified method to converge and, if relevant, the KL distance of the posterior distributions to the true marginal distributions. All quantities are summarized using boxplots , the blue boxplots representing sGaBP and the red boxplots the method it is being compared with. Each figure corresponds to a set of zero-diagonal spectral radii which is indicated on the x-axis. For every zero-diagonal spectral radius indicated on the x-axis, we generate 100 data structures each consisting of a precision matrix and potential vector. We use the method described in “Appendix 2” to regulate the zero-diagonal spectral radius of the precision matrix to the appropriate value. We then apply sGaBP and the method it is being compared with on these data structures. With the exception of the CG solver all other methods require the specification of hyper-parameter(s). We initialize these methods by finding the value(s) of the hyper-parameter(s) yielding the fastest convergence through a line (grid) search in increments of 0.01. We refer to sGaBP (for instance), initialized with the optimal hyper-parameter determined through the line search, as optimal sGaBP. Similar labels are used for the other methods.

We now have 100 data structures for every zero-diagonal spectral radius given on the x-axis of the figures. We apply optimal sGaBP and the (optimized) competitor and record the number of iterations required by each method to converge. The blue boxplot is constructed from the number of iterations required by sGaBP to converge and the red boxplot from the number of iterations required by the competitor. The KL distances are slightly more complicated since for each precision matrix we get multiple marginals. For each application of sGaBP (and its competitor), we determine the KL distance of all the posterior distributions to their respective marginals, a given data structure is represented by the mean of all these distances. Boxplots are then constructed in a similar way to those of the iterations. To account for differences in the scaling of quantities provided by different methods, it may be necessary to focus (or zoom in) on certain parts of a figure.

9.1 Relaxed Gaussian belief propagation

El-Kurdi et al. (2012a) illustrate the advantages of RGaBP on large ill-conditioned and weakly diagonally dominant inverse covariance matrices. RGaBP does not allow tuning of the precision components and can therefore only be applied in settings where the precision components of ordinary GaBP converge. The relaxation is applied on the mean components by setting \(z_{i}^{(n)} = \gamma (b_{i} + \sum _{j \in {\mathcal {N}}_{j}}V^{(n)}_{ji}) + (1-\gamma ) q_{i}^{(n)}\mu ^{(n-1)}_{i}\). Setting \(\gamma = 1\) gives ordinary GaBP (similar to setting \(\lambda = 0\) for sGaBP). Although El-Kurdi et al. (2012a) focus on relaxation factors (\(\gamma > 1\)), RGaBP can also be used to perform damping (\(\gamma < 1\)). There is an interesting relationship between RGaBP and sGaBP with regard to how posterior means are computed:

$$\begin{aligned} \text {RGaBP}: \mu _{i}^{(n)}&= \gamma \frac{(b_{i} + \sum _{j \in {\mathcal {N}}_{j}}V^{(n)}_{ji})}{q_{i}^{(n)}} + (1-\gamma ) \mu ^{(n-1)}_{i} \end{aligned}$$
(24)
$$\begin{aligned} \text {sGaBP}: \mu _{i}^{(n)}&= \frac{q_{i}^{(n)}}{\lambda + q_{i}^{(n)}}\frac{b_{i} + \sum _{j \in {\mathcal {N}}_{j}}V^{(n)}_{ji}}{q_{i}^{(n)}} \nonumber \\&\qquad + \frac{\lambda }{\lambda + q_{i}^{(n)}} \mu ^{(n-1)}_{i}. \end{aligned}$$
(25)

We see that \(\gamma = 1- \frac{\lambda }{\lambda + q_{i}^{(n)}}\). In contrast to RGaBP, sGaBP computes adaptive damping/relaxation factors using the tuning parameter \(\lambda \) and the posterior precisions. In particular, we see that relaxation, \(\gamma > 1\), and damping, \(\gamma < 1\), correspond to negative and positive \(\lambda \), respectively. This would imply that there is a role to play for negative \(\lambda \). Part of our comparison is to give an indication of when to use relaxation versus damping. It is also worthwhile to emphasize that the posterior precisions provided by RGaBP are the posterior precisions provided by ordinary GaBP. Another important contribution in this regard is to provide empirical evidence that sGaBP can provide posterior precisions closer to the true marginal precisions when compared to ordinary GaBP.

Fig. 3
figure 3

Comparison of the convergence speed of optimal sGaBP and optimal RGaBP over different zero-diagonal spectral radii. sGaBP outperformed RGaBP in these simulations, the relative convergence speed of RGaBP tending to decrease as the zero-diagonal spectral radius increases. (Color figure online)

Fig. 4
figure 4

This is similar to Fig. 3, but the boxplots now represent the mean KL distance between the posterior marginals (provided by each method) and the true marginals. In these simulations, sGaBP provided more accurate approximations to the true marginals

In Fig. 3, the convergence speed of optimal sGaBP and optimal RGaBP are compared. For smaller zero-diagonal spectral radii, the methods are very comparable with sGaBP holding a slight advantage. As the zero-diagonal spectral radius approaches 1.5, the convergence speed of RGaBP starts to destabilize. When considering the boxplot corresponding to a zero-diagonal spectral radius of 1.5 we see that sGaBP can converge up to 16 times faster than RGaBP. It is also worthwhile to note that outliers were suppressed in these boxplots.

Fig. 5
figure 5

This is similar to Figs. 3 and 4, however comparisons were made over different zero-diagonal spectral radii. In contrast to the previous figures, the smallest eigenvalue was used to regulate the zero-diagonal spectral radius. In these simulations, the optimal relaxation factor is greater than one and this corresponds to negative \(\lambda \) in the case of sGaBP. In these simulations, sGaBP converged faster and provided more accurate posterior marginals

In Fig. 4, the KL distances of optimal sGaBP and optimal RGaBP are compared. In the simulations, sGaBP provided far more accurate posterior distributions. The simulations provide evidence, even in cases where the optimal convergence speeds are comparable, that it is better to use sGaBP instead of RGaBP, since sGaBP provides posterior precisions closer to the true marginal precisions.

An interesting sub-plot is the role of relaxation versus damping in the acceleration of GaBP. Relaxation corresponds to \(\gamma > 1\), or negative \(\lambda \), while damping occurs when \(\gamma < 1\), or positive \(\lambda \). The zero-diagonal spectral radius of a precision matrix can be determined by one of two quantities, these being either the largest or the smallest eigenvalue of the precision matrix. In our simulations, we found that optimal convergence occurs with relaxation factors when the zero-diagonal spectral radius is determined by the smallest eigenvalue and otherwise damping. This indicates that relaxation can only be applied when the spectral radius is less than one, because if the zero-diagonal spectral radius is at least one and caused by the smallest eigenvalue the (standardized) precision matrix will either be singular or negative definite. Figure 5 is constructed by considering zero-diagonal spectral radii less than one and determined by the smallest eigenvalue of the precision matrix. Each application of optimal sGaBP and optimal RGaBP involved the use of relaxation factors. In terms of performance, we can make similar observations to those made on Figs. 3 and 4. In these simulations optimal sGaBP outperforms optimal rGaBP, both in terms of convergence speed and KL distances, with the relative performance improving as the zero-diagonal spectral radius approaches one. One can argue that the comparisons made in Fig. 5 are more relevant than the others made in this section since, as the name suggests, the focus of RGaBP is on relaxation factors.

Another method proposed in the literature to improve on the convergence behaviour of GaBP is based on the principle of message damping (Malioutov et al. 2006). As is mentioned by Malioutov et al. (2006), we found in our simulations that the convergence/divergence of the precision components is independent of the degree of damping applied. Furthermore, when the precisions do converge, we found that the degree of damping does not influence the actual converged posterior precisions. We also observed that RGaBP tends to outperform the message damping approach, based on optimal comparisons, and therefore did not include this in our empirical comparisons.

9.2 Compressed inner-loop convergence fix

figure d

The convergence fix (CF) method has been proposed in the literature as a method of solving arbitrary symmetric positive definite linear systems with GaBP (Johnson et al. 2009). The basic idea is to solve systems of the form \(({\mathbf {S}}+ \mathbf {\Gamma }){\varvec{\mu }}^{(n+1)} = {\varvec{b}} + \mathbf {\Gamma }{\varvec{\mu }}^{(n)}\) using ordinary GaBP. Johnson et al. (2009) show that if \({\mathbf {S}}+ \mathbf {\Gamma }\) is walk-summable, CF will converge and provide the correct solution to the system \({\mathbf {S}}{\varvec{\mu }} = {\varvec{b}}\). We restrict our focus to the case where \( \mathbf {\Gamma } = \lambda {\mathbf {I}}\). Johnson et al. (2009) make a quick reference to a compressed inner-loop version of CF where each application of GaBP is limited to one iteration. Johnson et al. (2009) report that compressed inner-loop CF can be more efficient than the original method, but may require damping on the adjustment of the potential vector. The closeness of the compressed CF variant to sGaBP depends heavily on the interpretation of the description in the literature. Johnson et al. (2009) do not prove convergence of compressed CF and do not consider the potential usefulness of the diagonal loadings of the precision matrix in the tuning of the posterior precisions. We could not find any reference to compressed CF in the source code provided by Bickson (2008) and we formed our interpretation here-of by considering the source code provided for the original CF method along with the description by Johnson et al. (2009). We give this interpretation in Algorithm 4. We now compare our interpretation of compressed CF to sGaBP.

Fig. 6
figure 6

Illustration of the iterations required for convergence of optimal sGaBP (blue) and optimal CF (red). Both methods are relatively stable, however sGaBP converged faster in the simulations. The relative performance of sGaBP seems to improve with growth in the zero-diagonal spectral radius. (Color figure online)

The visual summaries of the iterations required for convergence and the KL distances are given in Figs. 6 and 7, respectively. In the simulations, sGaBP outperformed CF in terms of convergence speed. Both methods were relatively stable in terms of the number of iterations required for convergence. The performance of CF in terms of KL distances to the true marginals was poor relative to the performance of sGaBP. In our simulations, we found that the degree of diagonal loadings required by CF to converge optimally was substantially higher than the tuning parameter required by optimal sGaBP. These simulations provide empirical evidence that sGaBP should be used instead of our interpretation of CF, both in terms of convergence speed and accuracy of the posterior distributions.

Fig. 7
figure 7

Illustration of the accuracy of the posterior distributions of optimal sGaBP (blue) and optimal CF (red) to the true marginals. In the simulations, sGaBP provided more accurate approximations. The poor performance of CF is due to the high values of the diagonal loadings it requires to converge optimally. (Color figure online)

9.3 Conjugate gradient

Fig. 8
figure 8

Comparison of the conjugate gradient solver with sGaBP. The CG method does not give approximations to the marginal precisions, but we do include the mean KL distances for sGaBP. In our simulations, these methods are comparable in terms of iterations required for convergence. The boxplots corresponding to each method are reasonably stable. The CG boxplots have a slight advantage for larger zero-diagonal spectral radii. Note that for each zero-diagonal spectral radius the boxplots corresponding to sGaBP and CG were plotted adjacent to each other

One of the attractive properties of GaBP as a solver of large and sparse systems of linear equations lies in distributive computing. In general, BP algorithms are well suited to distributive implementation, under synchronous message scheduling, since no communication is required between nodes not linked in the graph. Like GaBP the CG method is a solver of linear systems and can be applied in distributive settings. A description of the CG solver can be found in Shewchuk (1994). Unlike GaBP, CG is guaranteed to converge for all symmetric and positive definite linear systems. Furthermore, CG is guaranteed to converge in at most p iterations where p is the number of variables in the system. This causes sGaBP to compare unfavourably with CG in small linear systems and hence our focus will be on systems with a large number of variables. In practice, the CG method converges much faster than p iterations and the convergence becomes faster for linear systems with a smaller conditioning number. One of the contributions of this paper is a message-passing scheme which guarantees convergence and therefore we include a comparison with the CG method. We note here that the sGaBP and CG methods come from different areas of mathematics, those being approximate inference and linear algebra, respectively. The main advantage of the CG method is that it does not require any regularization, while sGaBP provides approximate precisions.

We now compare CG with optimal sGaBP in linear systems with 700 variables. The results are given in Fig. 8. In these simulations, we see that both methods are quite stable and very comparable, although CG has a small advantage in the simulations involving larger zero-diagonal spectral radii. The bottom plot of Fig. 8 shows the mean KL distances obtained for sGaBP. We see that these distances are small and therefore the posterior precisions can be useful as approximations of the true marginal precisions.

There are strategies which can be used to accelerate sGaBP. One approach would be to consider asynchronous message passing. The main drawback with this strategy is the loss of distributive applicability. Another approach is to use multiple tuning parameters, that is one tuning parameter for each node. This will not only improve on convergence speed, but could also be used to obtain (even) more accurate approximations to the marginal precisions. The disadvantage is that the complexity of deciding on the amount of tuning is amplified. Another interesting strategy is to increase the dimension of nodes, that is assigning more than one variable to each node. The difficulty here is deciding on which variables to cluster together in nodes and that communication between higher dimensional nodes is computationally more expensive. In certain situations, we found that the GDH can improve on optimal sGaBP in terms of convergence speed. It is also possible to extend GDH to allow for multiple tuning parameters which (hopefully) will accelerate convergence. The main problem surrounding GDH is the specification of the step size.

There are strategies to accelerate CG as well, the most prominent being that of preconditioning. Consider solving the system \({\mathbf {S}}{\varvec{\mu }} = {\varvec{b}}\). The idea behind preconditioning is to select a matrix \({\mathbf {P}}\), solve \({\mathbf {P}}{\mathbf {S}}{\mathbf {P}}'\tilde{{\varvec{\mu }}} = {\mathbf {P}}{\varvec{b}}\) and transform back to the original system using \({\varvec{\mu }} = {\mathbf {P}}'\tilde{{\varvec{\mu }}}\). The matrix \({\mathbf {P}}\) should be selected such that the conditioning number of \({\mathbf {P}}{\mathbf {S}}{\mathbf {P}}'\) is smaller than that of \({\mathbf {S}}\) and the computational cost of computing \({\mathbf {P}}{\mathbf {S}}{\mathbf {P}}'\) must be low. Here, we wish to emphasize that sGaBP can also benefit substantially from this type of preconditioning, even more so because it makes the selection of tuning parameters easier. The major loss is in terms of the accuracy of the posterior precisions as approximates for the marginal precisions. This is because the posterior precisions now approximate the marginal precisions of \({\mathbf {P}}{\mathbf {S}}{\mathbf {P}}'\) and transformation back to \({\mathbf {S}}\) cannot (directly) be done without knowing the off-diagonal entries of \([{\mathbf {P}}{\mathbf {S}}{\mathbf {P}}']^{-1}\). Finding a method to sensibly transform the approximate precisions back to the original scale will be very rewarding.

10 Concluding remarks and further research

We proposed an adjusted BP method on general MGs to address some of the problems underlying general BP. We took this high-level approach and applied it to a Gaussian MG, this type of BP is referred to as Gaussian belief propagation. We showed that sGaBP (our variant of GaBP) will always converge, with sufficient regularization, and showed how to compute posterior distributions to preserve the posterior means as exact marginal means. We provided empirical evidence that the posterior precisions provided by sGaBP are better approximations of the true marginal precisions when compared to two other variants of GaBP where hyper-parameters were initialized to yield the fastest convergence. This seems to indicate that our high-level approach should be investigated in other MGs and perhaps also other graph structures (such as cluster graphs). Within the GaBP context, there are some questions that should be the subject of further research. The use of asynchronous message updates needs attention. The ranges of \(\lambda \) which guarantee convergence need to be specified and work needs to be done on methods seeking the value of \(\lambda \) which yields the fastest convergence. Some theoretical bounds on the proximity of the posterior precisions to the marginal precisions at the value of \(\lambda \) corresponding to the fastest convergence would be useful. Further improvements on convergence and the accuracy of posterior distributions can be obtained through the use of multiple regularization parameters and this should be investigated. Another natural extension of our work is a generalization to multivariate nodes. Another interesting prospect is considering other loss functions in Eq. 5. For instances setting \({\varvec{\mu }}_{i}^{(n-1)} = \mathbf {0}\) and using \(q=2\) relates to ridge regression under the linear model while \(q = 1\) relates to the Lasso. The sGaBP implementation of the Lasso can be done without loss of the conjugacy of the messages by majorization of the \(L_{1}\)-norm by a \(L_{2}\)-norm. To ensure general convergence in the case of the Lasso it may be necessary to use a penalty of the form \(\tau ||{\mathbf {x}}_{i}||_{1} + \frac{\lambda }{2}||{\mathbf {x}}_{i}-{\varvec{\mu }}_{i}^{(n-1)}||_{2}^{2}\) while applying a working-set method on the messages being updated. The latter is necessary since majorization of an absolute with a quadratic function is not possible at the origin.