1 Introduction

Binary variables occur in many models of interest. Variational autoencoders (VAE) with binary latent states are used to learn generative models with compressed representations [10, 11, 22, 33] and to learn binary hash codes for text and image retrieval [6, 30, 7, 20]. Neural networks with binary activations and weights are extremely computationally efficient and attractive for embedded applications, in particular pushed forward in the vision research [12, 8, 25, 34, 34, 36, 1, 15, 2, 31, 4, 17, 5]. Training these discrete models is possible via the stochastic relaxation, equivalent to training a Stochastic Binary Networks (SBN) [23, 24, 29, 26, 27]. In this relaxation, each binary weight is replaced with a Bernoulli random variable and each binary activation is replaced with a conditional Bernoulli variable. The gradient of the expected loss in the weight probabilities is well defined and SGD optimization can be applied.

For the problem of estimating gradient of expectation in probabilities of (conditional) Bernoulli variables, several unbiased estimators were proposed [19, 11, 9, 32, 35]. However, in the context of deep SBNs these methods become impractical: MuProp [11] and reinforce with baselines [19] have a prohibitively high variance in deep layers [28, Figs. C6, C7] while other methods’ complexity grows quadratically with the number of Bernoulli layers. In these cases, biased estimators were more successful in practice: straight-through (ST) [28], Gumbel-Softmax (GS) [13, 16] and their variants. In order to approximate the gradient of the expectation these methods use a single sample of all random entities and the derivative of the objective function extended to the real-valued domain. A more accurate PSA method was presented in [29], which has low computation complexity, but applies only to SBNs of classical structureFootnote 1 and requires specialized convolutions. Notably, it was experimentally reported [29, Fig. 4] that the baseline ST performs nearly identically to PSA in moderate size SBNs. Figure 1 schematically illustrates the bias-variance tradeoff with different approaches.

Fig. 1.
figure 1

Schematic illustration of bias-variance tradeoffs (we do not pretend on exactness, but see experimental evaluations in [29, 28]; notice that the Mean Squared Error (MSE) is the sum of variance and squared bias). Unbiased methods have a prohibitively high variance for deep models. PSA achieves a significant reduction in variance at a price of a small bias, but has a limited applicability. According to [29], ST estimator can be as accurate as PSA in wide deep models. We analytically study methods in the gray area: GS, DARN and FouST in order to find out whether they can offer a sound improvement over ST. In particular, for GS estimator the figure illustrates its possible tradeoffs when varying the temperature parameter according to the asymptotes we prove.

Contribution. In this work we analyze theoretical properties of several recent single-sample gradient based methods: GS, ST-GS  [13], BayesBiNN [18] and FouST [22]. We focus on clarifying these techniques, studying their limitations and identifying incorrect and over-claimed results. We give a detailed analysis of bias and variance of GS and ST-GS estimators. Next we analyze the application of GS in BayesBiNN. We show that a correct implementation would result in an extremely high variance. However due to a hidden issue, the estimator in effect reduces to a deterministic straight-through (with zero variance). A long-range effect of this swap is that BayesBiNN fails to solve the variational Bayesian learning problem as claimed. FouST [22] proposed several techniques for lowering bias and variance of the baseline ST estimator. We show that the baseline ST estimator was applied incorrectly and that some of the proposed improvements may increase bias and or variance.

We believe these results are valuable for researchers interested in applying these methods, working on improved gradient estimators or developing Bayesian learning methods. Incorrect results with hidden issues in the area could mislead many researchers and slow down development of new methods.

Outline. The paper is organized as follows. In Sect. 2 we briefly review the baseline ST estimator. In the subsequent sections we analyze Gumbel-Softmax estimator (Sect. 3), BayesBiNN (Sect. 4) and FouST estimator (Sect. 5). Proofs are provided in the respective Appendices A to C. As most of our results are theoretical, simplifying derivation or identifying limitations and misspecifications of the preceding work, we do not propose extensive experiments. Instead, we refer to the literature for the experimental evidence that already exists and only conduct specific experimental tests as necessary. In Sect. 6 we summarize our findings and discuss how they can facilitate future research.

2 Background

We define a stochastic binary unit \(x\sim \text {Bernoulli}(p)\) as \(x=1\) with probability p and \(x=0\) with probability \(1-p\). Let f(x) be a loss function, which in general may depend on other parameters and may be stochastic aside from the dependence on x. This is particularly the case when f is a function of multiple binary stochastic variables and we study its dependence on one of them explicitly. The goal of binary gradient estimators is to estimate

$$\begin{aligned} g = \frac{\mathrm {d}}{\mathrm {d}p}{\mathbb {E}}[f(x)], \end{aligned}$$
(1)

where \({\mathbb {E}}\) is the total expectation. Gradient estimators which we consider make a stochastic estimate of the total expectation by taking a single joint sample. We will study their properties with respect to x only given the rest of the sample fixed. In particular, we will confine the notion of bias and variance to the conditional expectation \({\mathbb {E}}_x\) and the conditional variance \(\mathbb {V}_x\). We will assume that the function f(x) is defined on the interval [0, 1] and is differentiable on this interval. This is typically the case when f is defined as a composition of simple functions, such as in neural networks. While for discrete inputs x, the continuous definition of f is irrelevant, it will be utilized by approximations exploiting its derivatives.

The expectation \({\mathbb {E}}_x [f(x)]\) can be written as

$$\begin{aligned} (1-p) f(0) + p f(1), \end{aligned}$$
(2)

Its gradient in p is respectively

$$\begin{aligned} g = \frac{\mathrm {d}}{\mathrm {d}p} {\mathbb {E}}_x [f(x)] = f(1) - f(0). \end{aligned}$$
(3)

While this is simple for one random variable x, it requires evaluating f at two points. With n binary units in the network, in order to estimate all gradients stochastically, we would need to evaluate the loss 2n times, which is prohibitive.

Of high practical interest are stochastic estimators that evaluate f only at a single joint sample (perform a single forward pass). Arguably, the most simple such estimator is the straight-through (ST) estimator:

$$\begin{aligned} \hat{g}_{\text {st}}= f'(x). \end{aligned}$$
(4)

For an in-depth introduction and more detained study of its properties we refer to [28]. The mean and variance of this ST estimator are given by

$$\begin{aligned} {\mathbb {E}}_x[\hat{g}_{\text {st}}]&= (1-p)f'(0) + pf'(1),\end{aligned}$$
(5a)
$$\begin{aligned} \mathbb {V}_x[\hat{g}_{\text {st}}]&= {\mathbb {E}}_x [\hat{g}_{\text {st}}^2] - ({\mathbb {E}}_x [\hat{g}_{\text {st}}])^2 = p (1-p)(f'(1) - f'(0))^2. \end{aligned}$$
(5b)

If f(x) is linear in x, i.e., \(f(x) = h x + c\), where h and c may depend on other variables, then \(f'(0) = f'(1) = h\) and \(f(1)-f(0) = h\). In this linear case we obtain

$$\begin{aligned} {\mathbb {E}}_x[\hat{g}_{\text {st}}]&= h,\end{aligned}$$
(6a)
$$\begin{aligned} \mathbb {V}_x[\hat{g}_{\text {st}}]&= 0. \end{aligned}$$
(6b)

From the first expression we see that the estimator is unbiased and from the second one we see that its variance (due to x) is zero. It is therefore a reasonable baseline: if f is close to linear, we may expect the estimator to behave well. Indeed, there is a theoretical and experimental evidence [28] that in typical neural networks the more units are used per layer, the closer we are to the linear regime (at least initially) and the better the utility of the estimate for optimization. Furthermore, [29] show that in SBNs of moderate size, the accuracy of ST estimator is on par with a more accurate PSA estimator.

We will study alternative single-sample approaches and improvements proposed to the basic ST. In order to analyze BayesBiNN and FouST we will switch to the \(\pm 1\) encoding. We will write \(y\sim \text {Bin}(p)\) to denote a random variable with values \(\{-1,1\}\) parametrized by \(p = \mathbb {P}_y(y{=}1)\). Alternatively, we will parametrize the same distribution using the expectation \(\mu = 2p-1\) and denote this distribution as \(\text {Bin}(\mu )\) (the naming convention and the context should make it unambiguous). Note that the mean of \(\text {Bernoulli}(p)\) is p. The ST estimator of the gradient in the mean parameter \(\mu \) in both \(\{0,1\}\) and \(\{-1,1\}\) valued cases is conveniently given by the same equation (4).

Proof

Indeed, \({\mathbb {E}}_y [f(y)]\) with \(y\sim \text {Bin}(\mu )\) can be equivalently expressed as \({\mathbb {E}}_x [\tilde{f}(x)]\) with \(x\sim \text {Bernoulli}(p)\), where \(p = \frac{\mu + 1}{2}\) and \(\tilde{f}(x) = f(2x -1)\). The ST estimator of gradient in the Bernoulli probability p for a sample x can then be written as

$$\begin{aligned} \hat{g}_{\text {st}}= \tilde{f}'(x) = 2 f'(y), \end{aligned}$$
(7)

where \(y= 2x-1\) is a sample from \(\text {Bin}(\mu )\). The gradient estimate in \(\mu \) becomes \(2 f'(y) \frac{\partial p}{\partial \mu } = f'(y)\).    \(\square \)

3 Gumbel Softmax and ST Gumbel-Softmax

Gumbel Softmax [13] and Concrete relaxation [16] enable differentiability through discrete variables by relaxing them to real-valued variables that follow a distribution closely approximating the original discrete distribution. The two works [13, 16] have contemporaneously introduced the same relaxation, but the name Gumbel Softmax (GS) became more popular in the literature.

A categorical discrete random variable x with K category probabilities \(\pi _k\) can be sampled as

$$\begin{aligned} x = \mathrm {arg\,max}_k(\log \pi _k - \varGamma _k ), \end{aligned}$$
(8)

where \(\varGamma _k\) are independent Gumbel noises. This is known as Gumbel reparametrization. In the binary case with categories \(k\in \{1,0\}\) we can express it as

$$\begin{aligned} x = [\![\log \pi _1 - \varGamma _{1} \ge \log \pi _0 - \varGamma _{0} ]\!], \end{aligned}$$
(9)

where \([\![\cdot ]\!]\) is the Iverson bracket. More compactly, denoting \(p=\pi _1\),

$$\begin{aligned} x = [\![\log \frac{p}{1-p} - (\varGamma _{1} - \varGamma _{0}) \ge 0 ]\!]. \end{aligned}$$
(10)

The difference of two Gumbel variables \(z = \varGamma _{1} - \varGamma _{0}\) follows the logistic distribution. Its cdf is \(\sigma (z) = \frac{1}{1+e^{-z}}\). Denoting \(\eta = \mathrm{logit}(p)\), we obtain the well-known noisy step function representation:

$$\begin{aligned} x = [\![\eta - z \ge 0 ]\!]. \end{aligned}$$
(11)

This reparametrization of binary variables is exact but does not yet allow for differentiation of a single sample because we cannot take the derivative under the expectation of this function in (1). The relaxation [13, 16] replaces the threshold function by a continuously differentiable approximation \(\sigma _\tau (\eta ) := \sigma (\eta /\tau ) = \frac{1}{1+e^{-\eta /\tau }}\). As the temperature parameter \(\tau >0\) decreases towards 0, the function \(\sigma _\tau (\eta )\) approaches the step function. The GS estimator of the derivative in \(\eta \) is then defined as the total derivative of f at a random relaxed sample:

$$\begin{aligned} z&\sim \text {Logistic},\end{aligned}$$
(12a)
$$\begin{aligned} \tilde{x}&= \sigma _\tau (\eta - z),\end{aligned}$$
(12b)
$$\begin{aligned} \frac{\hat{d} f}{d \eta }&:= \frac{\mathrm {d}f (\tilde{x})}{\mathrm {d}\eta } = f'(\tilde{x}) \frac{\partial \tilde{x}}{\partial \eta }. \end{aligned}$$
(12c)

A possible delusion about GS gradient estimator is that it can be made arbitrary accurate by using a sufficiently small temperature \(\tau \). This is however not so simple and we will clarify theoretical reasons for why it is so. An intuitive explanation is proposed in Fig. 2. Formally, we show the following properties.

Fig. 2.
figure 2

GS Estimator: relaxed samples \(\tilde{x}\) are obtained and differentiated as follows. Noisy inputs, following a shifted logistic distribution (black density), are passed through a smoothed step function \(\sigma _\tau \) (blue). Observe that for a small \(\tau \), the derivative is often (in probability of \(\eta -z\)) close to zero (green) and, very rarely, when \(|\eta -z|\) is small, it becomes \(O(1/\tau )\) large (red). (Color figure online)

Proposition 1

GS estimator is asymptotically unbiased as \(\tau \rightarrow 0\) and the bias decreases at the rate \(O(\tau )\) in general and at the rate \(O(\tau ^2)\) for linear functions.

Proof in Appendix A. The decrease of the bias with \(\tau \rightarrow 0\) is a desirable property, but this advantage is practically nullified by the fast increase of the variance:

Proposition 2

The variance of GS estimator grows at the rate \(O(\frac{1}{\tau })\).

Proof in Appendix A. This fast growth of the variance prohibits the use of small temperatures in practice. In more detail the behavior of the gradient estimator is described by the following two propositions.

Proposition 3

For any given realization \(z \ne \eta \) the norm of GS estimator asymptotically vanishes at the exponential rate \(O(\frac{1}{\tau } c^{1/\tau })\) with \(c=e^{-|x|} < 1\).

Proof in Appendix A. For small x, where c is close to one, the term \(1/\tau \) dominates at first. In particular for \(z=\eta \), the asymptote is \(O(1/\tau )\). So while for the most of noise realizations the gradient magnitude vanishes exponentially quickly, it is compensated by a significant grows at rate \(1/\tau \) around \(z=\eta \). In practice it means that most of the time a value of gradient close to zero is measured and occasionally, very rarely, a value of \(O(1/\tau )\) is obtained.

Proposition 4

The probability to observe GS gradient of norm at least \(\varepsilon \) is asymptotically \(O(\tau \log (\frac{1}{\varepsilon }))\), where the asymptote is \(\tau \rightarrow 0\), \(\varepsilon \rightarrow 0\).

Proof in Appendix A.

Unlike ST, GS estimator with \(\tau >0\) is biased even for linear objectives. Even for a single neuron and a linear objective it has a non-zero variance. Propositions 3 and 4 apply also to the case of a layer with multiple units since they just analyze the factor \(\frac{\partial }{\partial \eta }\sigma _\tau (\eta -z)\), which is present independently at all units. Proposition 4 can be extended to deep networks with L layers of Bernoulli variables, in which case the chain derivative will encounter L such factors and we obtain that the probability to observe a gradient with norm at least \(\varepsilon \) will vanish at the rate \(O(\tau ^L)\).

These facts should convince the reader of the following: it is not possible to use a very small \(\tau \), not even with an annealing schedule starting from \(\tau =1\). For a very small \(\tau \) the most likely consequence would be to never encounter a numerically non-zero gradient during the whole training. For moderately small \(\tau \) the variance would be prohibitively high. Indeed, Jang et al. [13] anneal \(\tau \) only down to 0.5 in their experiments.

A major issue with this and other relaxation techniques (i.e. techniques using relaxed samples \(\tilde{x}\in \mathbb {R}\)) is that the relaxation biases all the expectations. There is only one forward pass and hence the relaxed samples \(\tilde{x}\) are used for all purposes, not only for the purpose of estimating the gradient with respect to the given neuron. It biases all expectations for all other units in the same layer as well as in preceding and subsequent layers (in SBN). Let for example f depend on additional parameters \(\theta \) in a differentiable way. More concretely, \(\theta \) could be parameters of the decoder in VAE. With a Bernoulli sample x, an unbiased estimate of gradient in \(\theta \) can be obtained simply as \(\frac{\partial }{\partial \theta }f(x;\theta )\). However, if we replace the sample with a relaxed sample \(\tilde{x}\), the estimate \(\frac{\partial }{\partial \theta }f(\tilde{x};\theta )\) becomes biased because the distribution of \(\tilde{x}\) only approximates the distribution of x. If y were other binary variables relaxed in a similar way, the gradient estimate for x will become more biased because \(E_{\tilde{y}}[\nabla _{\tilde{x}}f(\tilde{x}, \tilde{y})]\) is a biased estimate of \(E_{y}[\nabla _{\tilde{x}}f(\tilde{x}, y)]\) desired. Similarly, in a deep SBN, the relaxation applied in one layer of the model additionally biases all expectations for units in layers below and above. In practice the accuracy for VAEs is relatively good [13], [28, Fig. 3] while for deep SBNs a bias higher than of ST is observed for \(\tau =1\) in a synthetic model with 2 or more layers and with \(\tau =0.1\) for a model with 7 (or more) layers [29, Fig. C.6]. When training moderate size SBNs on real data, it performs worse than ST [29, Fig. 4].

ST Gumbel-Softmax. Addressing the issue that relaxed variables deviate from binary samples on the forward pass, Jang et al. [13] proposed the following empirical modification. ST Gumbel-Softmax estimator [13] keeps the relaxed sample for the gradient but uses the correct Bernoulli sample on the forward pass:

$$\begin{aligned} z&\sim \text {Logistic},\end{aligned}$$
(13a)
$$\begin{aligned} \tilde{x}&= \sigma _\tau (\eta - z),\end{aligned}$$
(13b)
$$\begin{aligned} x&= [\![\eta - z \ge 0 ]\!],\end{aligned}$$
(13c)
$$\begin{aligned} \hat{g}_{\text {st-gs}(\tau )}&= f'(x) \frac{\partial \tilde{x}}{\partial \eta }. \end{aligned}$$
(13d)

Note that x is now distributed as \(\text {Bernoulli}(p)\) with \(p =\sigma (\eta )\) so the forward pass is fixed. We show the following asymptotic properties.

Proposition 5

ST Gumbel-Softmax estimator [13] is asymptotically unbiased for quadratic functions and the variance grows as \(O(1/\tau )\) for \(\tau \rightarrow 0\).

Proof in Appendix A.

To summarize, ST-GS is more expensive than ST as it involves sampling from logistic distribution (and keeping samples), it is biased for \(\tau >0\). It becomes unbiased for quadratic functions as \(\tau \rightarrow 0\), which would be an improvement over ST, but the variance grows as \(\frac{1}{\tau }\).

4 BayesBiNN

Meng et al. [18], motivated by the need to reduce the variance of reinforce, apply GS estimator. However, in their large-scale experiments they use temperature \(\tau = 10^{-10}\). According to the previous section, the variance of GS estimator should go through the roof as it grows as \(O(\frac{1}{\tau })\). It is practically prohibitive as the learning would require an extremely small learning rate and a very long training time as well as high numerical accuracy. Nevertheless, good experimental results are demonstrated [18]. We identify a hidden implementation issue which completely changes the gradient estimator and enables learning.

First, we explain the issue. Meng et al. [18] model stochastic binary weights as \(w\sim \text {Bin}(\mu )\) and express GS estimator as follows.

Proposition 6

(Meng et al. [18] Lemma 1). Let \(w\sim \text {Bin}(\mu )\) and let \(f:\{-1,1\}\rightarrow \mathbb {R}\) be a loss function. Using parametrization \(\mu = \tanh (\lambda )\), \(\lambda \in \mathbb {R}\), GS estimator of gradient \(\frac{\mathrm {d}{\mathbb {E}}_w[f]}{\mathrm {d}\mu }\) can be expressed as

$$\begin{aligned} \delta&\sim \frac{1}{2}\mathrm{Logistic},\end{aligned}$$
(14a)
$$\begin{aligned} \tilde{w}&= \tanh _{\tau }(\lambda - \delta ) \equiv \tanh (\frac{\lambda - \delta }{\tau }),\end{aligned}$$
(14b)
$$\begin{aligned} J&= \frac{1 - \tilde{w}^2}{\tau (1 - \mu ^2)},\end{aligned}$$
(14c)
$$\begin{aligned} \hat{g}&= J f'(\tilde{w}), \end{aligned}$$
(14d)

which we verify in Appendix B. However, the actual implementation of the scaling factor J used in the experiments [18] according to the published codeFootnote 2 introduces a technical \(\epsilon =10^{-10}\) as follows:

$$\begin{aligned} J := \frac{1-\tilde{w}^2 + \epsilon }{\tau (1-\mu ^2 + \epsilon )}. \end{aligned}$$
(15)

It turns out this changes the nature of the gradient estimator and of the learning algorithm. The BayesBiNN algorithm [18, Table 1middle] performs the update:

$$\begin{aligned} \lambda&:= (1-\alpha ) \lambda - \alpha s f'(\tilde{w}), \end{aligned}$$
(16)

where \(s = N J\), N is the number of training samples and \(\alpha \) is the learning rate.

Proposition 7

With the setting of the hyper-parameters \(\tau = O(10^{-10})\) [18, Table 7] and \(\epsilon = 10^{-10}\) (author’s implementation) in large-scale experiments (MNIST, CIFAR10, CIFAR100), the BayesBiNN algorithm is practically equivalent to the following deterministic algorithm:

$$\begin{aligned} w&:= \text {sign}(\bar{\lambda });\end{aligned}$$
(17a)
$$\begin{aligned} \bar{\lambda }&:= (1-\alpha ) \bar{\lambda }- \alpha f'(w). \end{aligned}$$
(17b)

In particular, it does not depend on the values of \(\tau \) and N.

Proof in Appendix B. Experimentally, we have verified, using authors implementation, that indeed parameters \(\lambda \) (16) grow to the order \(10^{10}\) during the first iterations, as predicted by our calculations in the proof.

Notice that the step made in (17b) consists of a decay term \(-\alpha \bar{\lambda }\) and the gradient descent term \(- \alpha f'(w)\), where the gradient \(f'(w)\) is a straight-through estimate for the deterministic forward pass \(w = \text {sign}(\bar{\lambda })\). Therefore the deterministic ST is effectively used. It is seen that the decay term is the only remaining difference to the deterministic STE algorithm [18, Table 1left], the method is contrasted to. From the point of view of our study, we should remark that the deterministic ST estimator used in effect indeed decreases the variance (down to zero) however it increases the bias compared to the baseline stochastic ST  [28].

The issue has also downstream consequences for the intended Bayesian learning. The claim of Proposition 7 that the method does not depend on \(\tau \) and N is perhaps somewhat unexpected, but it makes sense indeed. The initial BayesBiNN algorithm of course depends on \(\tau \) and N. However due to the issue with the implementation of Gumbel Softmax estimator, for a sufficiently small value of \(\tau \) it falls into a regime which is significantly different from the initial Bayesian learning rule and is instead more accurately described by (17). In this regime, the result it produces does not dependent on the particular values of \(\tau \) and N. While we do not know what problem it is solving in the end, it is certainly not solving the intended variational Bayesian learning problem. This is so because the variational Bayesian learning problem and its solution do depend on N in a critical way. The algorithm (17) indeed does not solve any variational problem as there is no variational distribution involved (nothing sampled). Yet, the decay term \(-\alpha \lambda \) stays effective: if the data gradient becomes small, the decay term implements some small “forgetting” of the learned information and may be responsible for an improved generalization observed in the experiments [18].

5 FouST

Pervez et al. [22] introduced several methods to improve ST estimators using Fourier analyzes of Boolean functions [21] and Taylor series. The proposed methods are guided by this analysis but lack formal guarantees. We study the effect of the proposed improvements analytically.

One issue with the experimental evaluation [22] is that the baseline ST estimator [22, Eq. 7] is misspecified: it is adopted from the works considering \(\{0,1\}\) Bernoulli variables without correcting for \(\{-1,1\}\) case as in (7), differing by a coefficient 2. The reason for this misspecifications is that ST is know rather as a folklore, vaguely defined, method (see [28]). While in learning with a simple expected loss this coefficient can be compensated by the tuned learning rate, it can lead to a more serious issues, in particular in VAEs with Bernoulli latents and deep SBNs. VAE training objective [14] has the data evidence part, where binary gradient estimator is required and the prior KL divergence part, which is typically computed analytically and differentiated exactly. Rescaling the gradient of the evidence part only introduces a bias which cannot be compensated by tuning the learning rate. Indeed, it is equivalent to optimizing the objective with the evidence part rescaled. In [28, Fig. 2] we show that this effect is significant. In the reminder of the section we will assume that the correct ST estimator (7) is used as the starting point.

5.1 Lowering Bias by Importance Sampling

The method [22, Sec. 4.1]“Lowering Bias by Importance Sampling”, as noted by authors, obtains DARN gradient estimator [10, Appendix A] who derived it by applying a (biased) control variate estimate in the reinforce method. Transformed to the encoding with \(\pm 1\) variables, it expresses as

$$\begin{aligned} \hat{g}_{\text {darn}}= f'(x)/p(x). \end{aligned}$$
(18)

By design [10], this method is unbiased for quadratic functions, which is straightforward to verify by inspecting its expectation

$$\begin{aligned} {\mathbb {E}}[\hat{g}_{\text {darn}}] = f'(1) + f'(0). \end{aligned}$$
(19)

While, this is in general an improvement over ST—we may expect that functions close to linear will have a lower bias, it is not difficult to construct an example when it can increase the bias compared to ST.

Example 1

The method [22, Sec. 4.1] “Lowering Bias by Importance Sampling”, also denoted as Importance Reweighing (IR), can increase bias.

Let \(p\in [0,1]\) and \(x \sim \text {Bin}(p)\). Let \(f(x) = |x + a|\). The derivative of \({\mathbb {E}}[f(x)]\) in p is

$$\begin{aligned} \frac{\mathrm {d}}{\mathrm {d}p} ((1-p) f(-1) + p f(1)) = f(1) - f(-1) = f(1) = 2a. \end{aligned}$$
(20)

The expectation of \(\hat{g}_{\text {st}}\) is given by

$$\begin{aligned} (1-p) 2 f'(-1) + p 2 f'(1) = 2 (2p-1). \end{aligned}$$
(21)

The expectation of \(\hat{g}_{\text {darn}}\) is given by

$$\begin{aligned} f'(-1) + f'(1) = 0. \end{aligned}$$
(22)

The bias of DARN is 2|a| while the bias of ST is \(2|a+1-2p|\). Therefore for \(a>0\) and \(p>0.5\), the bias of DARN estimator is higher. In particular for \(a=0.9\) and \(p=0.95\) the bias of ST estimator equals 0 while the bias of DARN estimator equals 1.8.

Furthermore, we can straightforwardly express its variance.

Proposition 8

The variance of \(\hat{g}_{\text {darn}}\) is expressed as

$$\begin{aligned} \mathbb {V}_z [\hat{g}_{\text {darn}}] = \frac{(f'(1) - p (f'(1) + f'(-1)))^2}{p(1-p)}. \end{aligned}$$
(23)

It has asymptotes \(O(\frac{f'(-1)^2}{1-p})\) for \(p\rightarrow 1\) and \(O(\frac{f'(1)^2}{p})\) for \(p\rightarrow 0\).

The asymptotes indicate that the variance can grow unbounded for units approaching deterministic mode. If applied in a deep network with L layers, L expressions (18) are multiplied and the variance can grow respectively. Interestingly though, if the probability p is defined using the sigmoid function as \(p=\sigma (\eta )\), then the gradient in \(\eta \) additionally multiplies by the Jacobian \(\sigma '(\eta ) = p (1-p)\), and the variance of the gradient in \(\eta \) becomes bounded. Moreover, a numerically stable implementation can simplify \(p(1-p)/p(x)\) for both outcomes of x. We conjecture that this estimator can be particularly useful with this parametrization of the probability (which is commonly used in VAEs and SBNs).

Experimental evidence [11, Fig. 2.a], where DARN estimator is denoted as “\(\frac{1}{2}\)” shows that the plain ST performs similar for the structural output prediction problem. However, [11, Fig. 3.a] gives a stronger evidence in favor of DARN for VAE. In Fig. 3 we show experiment for the MNIST VAE problem, reproducing the experiment [11, 22] (up to data binarization and implementation details). The exact specification is given in [28, Appendix D.1]. It is seen that DARN improves the training performance but needs an earlier stopping and or more regularization. Interestingly, with a correction of accumulated bias using unbiased ARM [35] method with 10 samples, ST leads to better final training and test performance.

Fig. 3.
figure 3

Experimental comparison of DARN and ST estimators on MNIST VAE. The plots show training and test loss (negative ELBO) during training for different learning rates. After 5000 epochs, an unbiased ARM-10 estimator is applied in order to measure (and correct) the accumulated bias. At the smaller learning rates, where DARN does not diverge, it clearly has a much smaller accumulated bias but manages to overfit significantly.

5.2 Reducing Variance via the Fourier Noise Operator

The Fourier noise operator [22, Sec. 2] is defined as follows. For \(\rho \in [0,1]\), let \(x'\sim N_\rho (x)\) denote that \(x'\) is set equal to x with probability \(\rho \) and chosen as an independent sample from \(\text {Bin}(p)\) with probability \(1-\rho \). The Fourier noise operator smooths the loss function and is defined as \(T_\rho [f](x) = {\mathbb {E}}_{x'\sim N_\rho (x)}[f(x')]\). When applied to f before taking the gradient, it can indeed reduce both bias and variance, ultimately down to zero when \(\rho =0\). Indeed, in this case \(x'\) is independent of x and \(T_\rho [f](x) = {\mathbb {E}}[f(x)]\), which is a constant function of x. However, the exact expectation in \(x'\) is intractable. The computational method proposed in [22, Sec. 4.2] approximates the gradient of this expectation using S samples \(x^{(s)}\sim N_\rho (x)\) as

$$\begin{aligned} \hat{g}_{\rho } = \frac{1}{S} \sum \nolimits _{s} \hat{g} (x^{(s)}), \end{aligned}$$
(24)

where \(\hat{g}\) is the base ST or DARN estimator. We show the following.

Proposition 9

The method [22, Sec. 4.2]“Reducing Variance via the Fourier Noise operator” does not reduce the bias (unlike \(T_\rho \)) and increases variance in comparison to the trivial baseline that averages independent samples.

Proof in Appendix C.

This result is found in a sharp contradiction with the experiments [22, Figure 4], where independent samples perform worse than correlated. We do not have a satisfactory explanation for this discrepancy except for the misspecified ST. Since the author’s implementation is not public, it is infeasible to reproduce this experiment in order to verify whether a similar improvement can be observed with the well-specified ST. Lastly, note, that unlike correlated sampling, uncorrelated sampling can be naturally applied with multiple stochastic layers.

5.3 Lowering Bias by Discounting Taylor Coefficients

For the technique [22, Sec. 4.3.1] “Lowering Bias by Discounting Taylor Coefficients” we present an alternative view, not requiring Taylor series expansion of f, thus simplifying the construction. Following [22, Sec. 4.3.1] we assume that the importance reweighing was applied. Since the technique samples \(f'\) at non-binary points, we refer to it as a relaxed DARN estimator. It can be defined as

$$\begin{aligned} \tilde{g}_{\text {darn}}(x,u) = \frac{f'(x u)}{p(x)}, \text {where } \ u\sim \mathcal {U}[0,1]. \end{aligned}$$
(25)

In the total expectation, when we draw x and u multiple times, the gradient estimates are averaged out. The expectation over u alone effectively integrates the derivative to obtain:

$$\begin{aligned} {\mathbb {E}}_u \big [\tilde{g}_{\text {darn}}(x,u)] = {\left\{ \begin{array}{ll} \frac{1}{p}\int _{0}^{1} f'(u) \mathrm {d}u = \frac{1}{p}(f(1) - f(0)), &{} \text {if } \ \ x = 1,\\ \frac{1}{1-p}\int _{0}^{1} f'(-u) \mathrm {d}u = \frac{1}{1-p}(f(-1) - f(0)), &{} \text {if } \ x = -1.\\ \end{array}\right. } \end{aligned}$$
(26)

In the expectation over x we therefore obtain

$$\begin{aligned} {\mathbb {E}}_{x,u} [\tilde{g}_{\text {darn}}(x,u)] = f(1) - f(-1), \end{aligned}$$
(27)

which is the correct derivative. One issue, discussed by [22] is that variance increases (as there is more noise in the system). However, a major issue similar to GS estimator Sect. 3, reoccurs here, that all related expectations become biased. In particular (25) becomes biased in the presence of other variables Pervez et al. [22, Sec. 4.3.1] propose to use \(u\in \mathcal {U}[a, 1]\) with \(a>0\), corresponding to shorter integration intervals around \(\pm 1\) states, in order to find an optimal tradeoff.

5.4 Lowering Bias by Representation Rescaling

Consider the estimator \(\hat{g}\) of the gradient of function \({\mathbb {E}}_x[f(x)]\) where \(x \sim \text {Bin}(p)\). Representation rescaling is defined in [22, Algorithm 1] as drawing \(\tilde{x} \sim \frac{1}{\tau } \text {Bin}(p)\) instead of x and then using FouST estimator based on the derivative \(f'(\tilde{x})\). It is claimed that using a scaled representation can decrease the bias of the gradient estimate. However, the following issue occurs.

Proposition 10

The method [22, Sec. 4.3.2] “Lowering Bias by Representation Rescaling” compares biases of gradient estimators of different functions.

Proof

Sampling \(\tilde{x}\) can be equivalently defined as \(\tilde{x} = x/\tau \). Bypassing the analysis of Taylor coefficients [22], it is easy to see that for a smooth function f, as \(\tau \rightarrow \infty \), \(f(x/\tau )\) approaches a linear function of x and therefore the bias of the ST estimator of \({\mathbb {E}}_x[f(x/\tau )]\) approaches zero. However, clearly \({\mathbb {E}}_x[f(x/\tau )]\) is a different function from \({\mathbb {E}}_x[f(x)]\) which we wish to optimize.    \(\square \)

We explain, why this method nevertheless has effect. Choosing and fixing the scaling hyper-parameter \(\tau \) is equivalent to staring from a different initial point, where (initially random) weights are scaled by \(1/\tau \). At this initial point, the network is found to be closer to a linear regime, where the ST estimator is more accurate and possibly the vanishing gradient issue is mitigated. Thus the method can have a positive effect on the learning as observed in [22, Appendix Table 3].

6 Conclusion

We theoretically analyzed properties of several methods for estimation of binary gradients and gained interesting new insights.

  • For GS and ST-GS estimator we proposed a simplified presentation for the binary case and explained detrimental effects of low and high temperatures. We showed that bias of ST-GS estimator approaches that of DARN, connecting these two techniques.

  • For BayesBiNN we identified a hidden issue that completely changes the behavior of the method from the intended variational Bayesian learning with Gumbel-Softmax estimator, theoretically impossible due to the used temperature \(\tau =10^{-10}\), to non-Bayesian learning with deterministic ST estimator and latent weight decay. As this learning method shows improved experimental results, it becomes an open problem to clearly understand and advance the mechanism which facilitates this.

  • In our analysis of techniques comprising FouST estimator, we provided additional insights and showed that some of these techniques are not well justified. It remains open, whether they are nevertheless efficient in practice in some cases for other unknown reasons, not taken into account in this analysis.

Overall we believe our analysis clarifies the surveyed methods and uncovers several issues which limit their applicability in practice. It provides tools and clears the ground for any future research which may propose new improvements and would need to compare with existing methods both theoretically and experimentally. We hope that this study will additionally motivate such research.