Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

The decision tree model [8], perhaps due to its simplicity and fundamental nature has been extensively studied over decades, yet remains a fascinating source of some of the outstanding open questions. In the first part of this paper we focus on decision trees for Boolean functions, i.e., functions of the form \(f : \{0,1\}^n \rightarrow \{0, 1\}.\) In later section, we extend our results for decision trees over any finite field, i.e., for functions of the form \(\mathbb {F}_q^n \rightarrow \{0, 1\}.\) A deterministic decision tree \(D_f\) for \(f\) takes \(x = (x_1, \ldots , x_n)\) as an input and determines the value of \(f(x_1, \ldots , x_n)\) using queries of the form “\(\text {is } x_i = 1?\)”. Let \(C (D_f, x)\) denote the cost of the computation, i.e., the number of queries made by \(D_f\) on input \(x.\) The deterministic decision tree complexity of \(f\) is defined as \(D(f) = \mathop {\min }_{D_f} \max _x C (D_f, x).\)

Variants of decision tree model are fundamental for several reasons including their connection to other models such as communication complexity, their usability in analyzing more complicated models such as circuits, their mathematical elegance and richness, and finally the notoriety of some simple yet fascinating open questions about them such as the Evasiveness Conjecture [3, 14, 15, 19, 22] that have caught the imagination of generations of researchers over decades. In this paper we study a variant of decision trees called parity decision tree (PDT) and its extension over finite fields, which we call linear decision tree (LDT).

Motivation for Studying PDTs and LDTs

A parity decision tree may query “\(\text {is } \sum _{i \in S} x_i \equiv 1\pmod 2?\)” for an arbitrary subset \(S \subseteq [n ]= \{1, 2, \ldots , n\}.\) We call such queries parity queries. For a PDT \(P_f\) for \(f,\) let \(C(P_f,x)\) denote the number of parity queries made by \(P_f\) on input \(x.\) The parity decision tree complexity of \(f\) is \( D^\oplus (f) = \mathop {\min }_{P_f} \max _x C (P_f, x). \) Note that \(D^\oplus (f) \le D(f)\) as “is \(x_i = 1?\)” can be treated as a parity query.

The PDTs were introduced by Kushilevitz and Mansour [17] in the context of learning Boolean functions by estimating their Fourier coefficients. Several other models such as circuits and branching programs have been also been analysed in the past after augmenting their power by allowing counting operations.

In spite of being combinatorially rich and beautiful model, the PDT somehow remained dormant until recently where it was brought back into light in an entirely different context, namely the communication complexity of XOR functions [23, 31]. Shi and Zhang [31] and Montanaro and Osborne [23] have observed that the deterministic communication complexity \(CC(f^{\oplus })\) of computing \(f(x \oplus y)\), when \(x\) and \(y\) are distributed between the two parties, is upper bounded by \(D^\oplus (f)\). The importance for communication complexity comes from the conjecture [23, 31] that for some positive constant \(c\), every Boolean function \(f\) satisfies \(D^\oplus (f) = O((\log ||\widehat{f}||_0)^c);\) where \(||\widehat{f}||_0\) is the sparsity (number of non-zero Fourier coefficients) of \(f.\) Settling this conjecture in affirmative would confirm the famous Log-rank Conjecture [24] in the important special case of XOR functions. Recently Tsang et al. [36] confirm it for functions with constant degree over \(\mathbb {F}_2\) and Kulkarni and Santha [18] confirm it for \(AC^0\) functions.

Very recently, Bhrushundi, Chakraborty, and Kulkarni [4] connected parity decision trees to property testing of linear and quadratic functions. Their approach for instance can potentially be used to solve a long-standing open question of closing the gap for \(k\)-linearity by analysing the randomized PDT complexity of the function \(E_k\) that evaluates to \(1\) iff the number of \(1\)s in the input is exactly \(k.\) Recently PDTs were analysed further in several papers including [18, 32, 34, 36] and many more to come.

Similar to PDTs, the LDTs are closely related to the Fourier spectrum of functions over \(\mathbb {Z}_p.\) In recent paper by Shpilka, Tal, and Volk [32] the authors derive various structural results of the Fourier spectrum by analysing LDTs. Given the evidence of abundance of connections to other models and mathematics, and given the rich combintaorial structure of PDTs and LDTs, we believe that they deserve a systematic and independent study at this point. Our paper is a step in this direction.

Motivation for Studying Influence Lower Bounds

Proving lower bounds on the influence of Boolean functions has had a long history in Theoretical Computer Science. It is nicely summerized in the paper [29], we restate a part from that for illustration. Influence lower bounds have been crucial part of several fundamental results such as threshold phenomenon, lower bound on randomized query complexity of graph properties, quantum and classical equivalence etc. Ben-Or and Linial [6], in their 1985 paper on collective coin flipping, observe that the maximum influence \({{\mathrm{Inf}}}_{max}(f) \ge 1/n\) for any balanced function and conjectured \(\varTheta (\log n / n)\) bound. The seminal paper by Kahn, Kalai, Linial [16] confirmed the conjecture via an application of the Hypercontractive Inequality. This result was subsequently generalized by Talagrand [35] in order to show sharp threshold behaviour for monotone functions.

In their celebrated paper Every decision tree has an influential variable, O’Donnell, Saks, Schramme, and Servedio [29] showed a crucial inequality lower bounding the maximum influence: \({{\mathrm{Inf}}}_{max}(f) \ge {{\mathrm{Var}}}(f) / \varDelta (f),\) where \(\varDelta (f)\) denotes the minimum possible average depth of a decision tree for \(f.\) This inequality found application in the lower bounds on randomized query complexity of monotone graph properties. Homin Lee [20] found a simple inductive proof of the OSSS result. Recently Jain and Zhang [13] found another simple and conceptually different proof via the method of query elimination, which we use here.

Aaronson and Ambainis [1] study a conjecture lower bounding the maximum influence of real valued polynomials in terms of their degree. This conjecture, if true, would imply polynomial equivalence between bounded-error quantum and classical query complexity. These previous results seems to indicate the importance of lower bounds on influence in terms of several complexity measures. In this paper, we present such new lower bounds in terms of PDT and LDT complexity.

Our Results

Let \(D_\epsilon (f)\) and \(D^\oplus _\epsilon (f)\) denote the minimum depth of a DT and a PDT (resp.) computing \(f\) correctly on at least \(1 - \epsilon \) fraction of the inputs.

Theorem 1

For any Boolean function \(f\) and any \(\epsilon \ge 0:\)

$$ {{\mathrm{Inf}}}_{\max }(f) \ge \frac{{{\mathrm{Var}}}(f) - \epsilon }{D^\oplus _\epsilon (f)}. $$

Corollary 1

For any Boolean function \(f\) and any \(\epsilon > 0:\)

$$ D_\epsilon (f) \le \frac{1}{\epsilon ^2} \cdot D^\oplus (f) \cdot {{\mathrm{Inf}}}(f). $$

Corollary 2

If \(f\) is computable by a polynomial size constant depth circuit, i.e., \(f \in AC^0,\) then:Footnote 1

$$ D_\epsilon (f) = \widetilde{O}_\epsilon (D^\oplus (f)). $$

To prove Theorem 1 we use an adaptation of the query elimination method of Jain and Zhang. Our main observation is that assuming the uniform distribution on the inputs, one can eliminate seemingly powerful parity queries at the expense of \({{\mathrm{Inf}}}_{max}(f)\) error per elimination. Corollary 1 is obtained by analysing the ‘query the most influential variable’ strategy using our new bound. We extend Theorem 1 for LDTs over arbitrary fields (see Sect. 4). The Corollary 1 can also be extended with similar techniques; we omit its simple proof.

Theorem 2

Let \(q\) be a prime power. For any \(f:{\mathbb {F}}_q^n\rightarrow \{0, 1\}\) and any \(\epsilon \ge 0:\)

$$ {{\mathrm{Inf}}}_{\max }(f) \ge \frac{1}{q-1}\cdot \frac{{{\mathrm{Var}}}(f) - \epsilon }{D^{\oplus _q}_\epsilon (f)}. $$

Further we explore the power of PDTs for monotone functions and show:

Theorem 3

For any monotone Boolean function \(f\) and any \(\epsilon > 0:\)

$$ D_\epsilon (f) \le \frac{3}{\epsilon ^2} \cdot D^\oplus (f)^{3/2}. $$

To prove Theorem 3 we show an upper bound on \(L_1\) norm of Fourier spectrum in terms of PDT depth, which in turn gives an upper bound on sum of linear Fourier coefficients restricted to monotone functions. We adapt the proof of the same for ordinary decision trees by O’Donnell and Servedio. Our main observation is that under the uniform distribution on inputs their proof can be extended for PDTs as well. Our result naturally raises the following question:

Question 1

Is it true that for every monotone Boolean function \(f\) and for every \(\epsilon > 0\) we have:

$$ D_\epsilon (f) = \widetilde{O}_\epsilon (D^\oplus (f))? $$

It is also interesting to see if our results can be strengthened to \(D^\oplus _\epsilon \) rather than just \(D^\oplus \) as zero-error and bounded error complexities may behave differently.

We believe that our observations, although might appear simple, are indeed surprising. They seem to make a crucial qualitative point, that under the uniform distribution, the method of lower bounding the ordinary (randomized) decision tree complexity by \({{\mathrm{Var}}}(f) / {{\mathrm{Inf}}}_{max}(f)\) works equally well for seemingly much more powerful PDTs and LDTs as well. For non-balanced functions the uniform distribution does not seem to be an optimal choice for maximizing \({{\mathrm{Var}}}(f) / {{\mathrm{Inf}}}_{max}(f)\) but for balanced functions it does. As an application, finally we exhibit a gap between randomized PDT complexity and approximate \(L_1\), both of which are relevant for communication complexity of XOR functions.

Organization. Section 2 contains preliminaries. Section 3 contains the proof of Theorem 1. Section 4 contains the proof of Theorem 2. Unfortunately, we had to move the other proofs to appendix and hence omit it from this version due to space constraint.

2 Preliminaries

Fig. 1.
figure 1

A boolean decision tree

Randomized Decision Trees

A bounded error randomized decision tree \(R_f\) is a probability distribution over all deterministic decision trees such that for every input, the expected error of the algorithm is bounded by some fixed constant less than \(1/2\) (say \(1/3\)). The cost \(C(R_f, x)\) is the highest possible number of queries made by \(R_f\) on \(x\), and the bounded error randomized decision tree complexity of \(f\) is \( R(f) = \mathop {\min }_{R_f} \max _x C (R_f, x).\) Similarly one can define bounded error randomized PDT complexity of \(f\), denoted by \(R^\oplus (f).\) Using Yao’s min-max principle one may obtain: \(D_{1/3}(f) \le R(f)\) and \(D^\oplus _{1/3}(f) \le R^\oplus (f).\) (Fig. 1)

Variance and Influence

Let \(\mu _p\) denote the \(p\)-biased distribution on the Boolean cube, i.e., each co-ordinate is independently chosen to be \(1\) with probability \(p.\) The variance of a Boolean function is \({{\mathrm{Var}}}(f, p) := 4 \cdot {\Pr }_{x \leftarrow \mu _p} (f(x) = 0) {\Pr }_{x \leftarrow \mu _p} (f(x) = 1).\) The influence of the \(i^{th}\) variable under \(\mu _p\) is \( {{\mathrm{Inf}}}_i(f, p) := {\Pr }_{x \leftarrow \mu _p} (f(x) \ne f(x \oplus e_i)). \) Let \({{\mathrm{Inf}}}_{max}(f) := \max _i {{{\mathrm{Inf}}}_i(f)}.\) The total influence aka average sensitivity of \(f\) is \( {{\mathrm{Inf}}}(f, p) := \mathop {\sum }_i {{\mathrm{Inf}}}_i(f, p). \) In this paper we focus on \(p=1/2\) case.

Fourier Spectrum, Polynomial Degree, and Sparsity

Let \(f_\pm : \{-1, 1\}^n \rightarrow \{-1, 1\}\) be represented by the following polynomial with real coefficients: \(f_\pm (z_1,\ldots , z_n) = \mathop {\sum }_{S \subseteq [n]} \widehat{f}(S) \mathop {\prod }_{i \in S} z_i. \) The above polynomial is unique and it is called the Fourier expansion of \(f.\) The \(\widehat{f}(S)\) are called the Fourier coefficients of \(f.\) The polynomial degree of \(f\) is \(\deg (f) := max \{|S| \mid \widehat{f}(S) \ne 0\}.\) The sparsity of a Boolean function \(f\) is \( ||\widehat{f}||_0 := | \{ S \mid \widehat{f}(S) \ne 0 \} |. \) We know that \(\deg (f) \le D(f)\), \(\log ||\widehat{f}||_0 \le D_\oplus (f)\) and \(\log ||\widehat{f}||_0 \le \deg (f).\)

Representing Decision Trees

We represent a decision tree \(T\) as \(T = (x_i, T_0, T_1)\) where \(x_i\) denotes the first variable queried by \(T,\) i.e., \(x_i\) is the variable at the root of \(T:\) if \(x_i = 0\) then \(T_0\) is consulted; if \(x_i = 1\) then \(T_1\) is consulted. A leaf labeled \(1\) is represented as \((1, \emptyset , \emptyset )\) and the one labeled \(0\) is represented as \((0, \emptyset , \emptyset ).\) We represent a parity decision tree as \(T = (x_S, T_0, T_1);\) if \(\sum _{i \in S} x_i = 0 \pmod 2\) then consult \(T_0,\) else consult \(T_1.\) A leaf labeled \(1\) is represented as \((1, \emptyset , \emptyset )\) and the one labeled \(0\) is represented as \((0, \emptyset , \emptyset ).\)

The Query Elimination Lemma (Jain and Zhang)

Jain and Zhang prove the following simple yet powerful lemma:

Lemma 1

(Query Elimination Lemma). If \(T = (x_i, T_0, T_1)\) is an ordinary decision tree that computes \(f\) correctly on at least \(1 - \delta \) fraction of the inputs then either \(T_0\) or \(T_1\) computes \(f\) correctly on at least \(1 - \delta - {{\mathrm{Inf}}}_{i}(f)\) fraction of the inputs.

In this paper we observe that the above lemma can be adapted for parity decision trees. This observation is a crucial part of our results.

Overview of the Query Elimination Method

The query elimination method of Jain and Zhang works as follows: Suppose we have a decision tree of depth \(D_\epsilon (f)\) that computes \(f\) correctly on at least \(1 - \epsilon \) fraction of the inputs. We repeatedly apply the Query Elimination Lemma to obtain a decision tree that computes \(f\) correctly on at least \(1 - \epsilon - D_\epsilon (f) \cdot {{\mathrm{Inf}}}_{max}(f)\) fraction of the inputs without making any single query. Of course, such (zero-query) decision tree must make error on at least \({{\mathrm{Var}}}(f)\) fraction of the inputs. Hence: the error of the zero-query decision tree that we obtained \((\epsilon + D_\epsilon (f) \cdot {{\mathrm{Inf}}}_{max}(f))\) can be lower bounded by \({{\mathrm{Var}}}(f).\) In other words:

$$ D_\epsilon (f) \ge \frac{{{\mathrm{Var}}}(f) - \epsilon }{{{\mathrm{Inf}}}_{max}(f)}. $$

3 Every PDT Has an Influential Variable

In this section we present the proof of Theorem 1. We start with eliminating queries in PDTs.

Eliminating Ordinary Queries in PDTs

First we note that Jain and Zhang’s proof of the Query Elimination Lemma generalizes when \(T_i\) are parity decision trees instead of ordinary ones. In other words, if the first query in a parity decision tree is an ordinary query then one can remove it at the expense of \({{\mathrm{Inf}}}_{i}(f)\) increase in the error. We formulate this below.

Lemma 2

If \(T = (x_{\{i\}}, T_0, T_1)\) is a parity decision tree that computes \(f\) correctly on at least \(1 - \delta \) fraction of the inputs then either \(T_0\) with every occurrence of \(x_i\) hard-wired to \(0\) or \(T_1\) with every occurrence of \(x_i\) hard-wired to \(1\) computes \(f\) correctly on at least \(1 - \delta - {{\mathrm{Inf}}}_{i}(f)\) fraction of the inputs.

Eliminating Parity Queries in PDTs

Let \(T\) be a parity decision tree that computes \(f\) correctly on at least \(1 - \delta \) fraction of the inputs. Our idea is to convert the parity queries to an ordinary one and then eliminate the queries at the root of the tree. Let

$$\begin{aligned} Lf(x) := f(Lx). \end{aligned}$$

We apply the linear transformation \(L\) on the input space \(\mathbb {F}_2^n\) and work with \(Lf\) instead of \(f.\)

Observation 4

\({{\mathrm{Var}}}(f) = {{\mathrm{Var}}}(Lf) \text { and } D_\oplus (f) = D_\oplus (Lf).\)

Rotatating the PDT \(T\) : Without loss of generality, let us assume that the first parity query in \(T\) is the parity of the first \(k\) bits, i.e., \(x_1 \oplus \ldots \oplus x_k\) (for some \(k\)). Let \(g(x_1, \ldots , x_n) := f(x_1 \oplus \ldots \oplus x_k, x_2, \ldots , x_n).\) Note that \(g = Lf\) where \(L\) is the following invertible linear transformation on the vector space \(\mathbb {F}_2^n:\) \(L(x_1, \ldots , x_n) := (x_1 \oplus \ldots \oplus x_k, x_2, \ldots , x_n).\) Also note that: \(f(x_1, \ldots , x_n) = g(x_1 \oplus \ldots \oplus x_k, x_2, \ldots , x_n).\) Thus by querying \(x_1 \oplus \ldots \oplus x_k,\) we know the value of the ‘first input bit’ of \(g.\) Moreover the influence of the first variable remains unchanged.

Observation 5

\({{\mathrm{Inf}}}_{1}(g) = {{\mathrm{Inf}}}_1(f).\)

Note however that the influences of the variables \(x_2, \ldots , x_k\) might have changed!

A PDT \(T = (x_{[k]}, T_0, T_1)\) for \(f\) can be easily modified to a PDT \( L T\) for \( L f = g.\) We call the transformation from \(T\) to \(LT\) as the rotation of \(T\) and it is defined as follows:

$$ L (x_S, T_0, T_1) : = (L(x_S), L(T_0), L(T_1)), $$
$$ {\mathsf{(base~case)}}\ \ L(0, \emptyset , \emptyset ) = (0, \emptyset , \emptyset ), $$
$$ {\mathsf{(base~case)}}\ \ L(1, \emptyset , \emptyset ) = (1, \emptyset , \emptyset ) . $$

Next we observe that the error is preserved by a rotation.

Observation 6

If \(T\) computes \(f\) correctly on \(1 - \delta \) fraction of the inputs then \(LT\) computes \(g = Lf\) correctly on \(1 - \delta \) fraction of the inputs.

Moreover: the tree \(LT\) has a nice property that the query at the root is not an arbitrary parity query but in fact an ordinary query, i.e., a variable \(x_1.\) Hence we can use Lemma 2 to remove the first query at the expense of \({{\mathrm{Inf}}}_1(g) = {{\mathrm{Inf}}}_1(f)\) increase in the error. Thus we conclude that:

Proposition 1

If \(T\) computes \(f\) with error \(\delta \) then either \(L T_0\) or \(L T_1\) computes \(LF\) correctly on at least \(1 - \delta - {{\mathrm{Inf}}}_{max}(f)\) fraction of inputs.

Rotating the PDT \(LT_i\) back to \(T_i\) :

Observation 7

For the particular \(L\) above, \(L^{-1} = L.\)

Suppose that \(LT_i\) computes \(Lf\) correctly on at least \(1 - \delta - {{\mathrm{Inf}}}_{max}(f)\) fraction of the inputs.

Thus we can rewrite Observation 6 as follows:

Observation 8

If \(LT\) computes \(Lf\) correctly on \(1 - \delta \) fraction of the inputs then \(L(LT)\) computes \(f = L(Lf)\) correctly on \(1 - \delta \) fraction of the inputs.

Proof of Theorem 1. Since \(L (LT_i) = T_i\) and since \(LT_i\) computes \(Lf\) correctly on at least \(1 - \delta - {{\mathrm{Inf}}}_{max}(f)\) fraction of the inputs, \(T_i\) computes \(f\) with the same error. Notice that \(T_i\) makes one less parity query than \(T\). So we have eliminated one parity query with an increase in error at most \({{\mathrm{Inf}}}_{max}(f).\) Now we can repeat this process starting from a parity tree \(T\) of depth \(D^\oplus _\epsilon (f)\) that makes error on at most \(\epsilon \) fraction of the inputs to obtain a zero-query parity decision tree that makes at most \(\epsilon + D^\oplus _\epsilon (f) \cdot {{\mathrm{Inf}}}_{max}(f)\) error. The error of any zero-query parity decision tree must be at least \({{\mathrm{Var}}}(f).\) This completes the proof of Theorem 1.   \(\square \)

Remark 1

OR and AND functions on \(n\) variables can be computed with error probability at most \(1/n\) on every input, using \(O(\log n)\) parity queries chosen uniformly at random. Thus our Theorem 1 can be extended (up to a multiplicative poly-logarithmic factor) to the decision trees that use AND, OR, and PARITY queries. More generally, one can extend it to so called \(1+\) queries (see [10]) involving parities of (say polynomially many) arbitrary subsets.

4 Every Linear Decision Tree Has an Influential Variable

Let \(q\) be a prime power and \({\mathbb {F}}_q\) be the finite field with \(q\) elements. In this section we consider computing functions from \({\mathbb {F}}_q^n\) to \(\{0, 1\}\) with the model called linear decision trees, denoted by \(\oplus _q\)-DT. It is a computation tree, with each internal nodel \(v\) labeled by a linear form \(\ell :{\mathbb {F}}_q^n \rightarrow {\mathbb {F}}_q\). \(v\) has \(q\) children, whose edges connecting to \(v\) are labeled by elements from \({\mathbb {F}}_q\). The branching at node \(v\) is based on the evaluation of \(\ell \) on the input vector. It is clear that when \(q=2\), this model becomes the parity decision tree model for computing boolean functions. We use \(D^{\oplus _q}_\epsilon (f)\) to denote the smallest \(\oplus _q\)-DT for computing \(f:{\mathbb {F}}_q^n\rightarrow \{0, 1\}\) with error \(\epsilon \).

We will focus on the setting of uniform distribution over \({\mathbb {F}}_q^n\). For \(f:{\mathbb {F}}_q^n\rightarrow \{0, 1\}\), its variance is defined the same as \({{\mathrm{Var}}}(f)=4 \cdot \mathop {\Pr }(f(x) = 0) \mathop {\Pr }(f(x) = 1).\) If \(x\) and \(y\) in \({\mathbb {F}}_q^n\) differ only at the \(k\)th position, \(k\in [n]\), we denote this by \(x\sim _k y\). The influence of the \(k^{th}\) variable is \( {{\mathrm{Inf}}}_k(f) := \mathop {\Pr }\nolimits _{x\sim _k y}(f(x) \ne f(y)).\) Our main result is the following analogue of Theorem 1.

Theorem 2 , restated. For any function \(f:{\mathbb {F}}_q^n\rightarrow \{0, 1\}\) and any \(\epsilon \ge 0:\)

$$ {{\mathrm{Inf}}}_{\max }(f) \ge \frac{1}{q-1}\cdot \frac{{{\mathrm{Var}}}(f) - \epsilon }{D^{\oplus _q}_\epsilon (f)}. $$

We now prove Theorem 2. We shall adapt the proof of the query elimination lemma to \(\oplus _q\)-DT as follows.

Suppose \(T\) is a \(\oplus _q\)-DT for \(f:{\mathbb {F}}_q^n\rightarrow \{0, 1\}\). Let \(\ell :{\mathbb {F}}_q^n\rightarrow \{0, 1\}\) be the first query made by \(T\), and \(\ell (x_1, \dots , x_n)=\alpha _1x_1+\alpha _2x_2+\dots +\alpha _nx_n\). As \(\ell \) is not trivial, there exists some \(k\in [n]\) s.t. \(\alpha _k\ne 0\). Fix such a \(k\in [n]\). For \(i\in {\mathbb {F}}_q\), let \(T_i\) be the \(\oplus _q\)-DT to be executed when \(\ell (x)=i\).

For every \(T_i\), \(i\in {\mathbb {F}}_q\), construct a new \(\oplus _q\)-DT \(T_i'\), by replacing every occurrence of \(x_k\) in \(T_i\) with

$$ \frac{1}{\alpha _k}(i-(\alpha _1x_1+\dots +\alpha _{k-1}x_{k-1}+ \alpha _{k+1}x_{k+1}+\dots +\alpha _nx_n)). $$

It is clear that \(T_i'\) and \(T_i\) are related as follows. Let \(a=(a_1, \dots , a_n)\in {\mathbb {F}}_q^n\). Then \(T_i'(a_1, \dots , a_n)=T_i(a_1, \dots , a_{k-1}, b_k, a_{k+1}, \dots , a_n)\), where \(b_k\in {\mathbb {F}}_q\) s.t.

$$ \ell (a_1, \dots , a_{k-1}, b_k, a_{k+1}, \dots , a_n)=i. $$

For \(a=(a_1, \dots , a_n)\in {\mathbb {F}}_q^n\), we use \(a|_k^{\ell , i}\) to denote \((a_1, \dots , a_{k-1}, b_k, a_{k+1}, \dots , a_n)\in {\mathbb {F}}_q^n\) satisfying the above. Then we have \(T_i'(a)=T_i(a|_k^{\ell , i})\).

As \(T\) computes \(f\) with error \(\epsilon \), there exists some \(j\in {\mathbb {F}}_q\), s.t. when restricting to \(\{a\in {\mathbb {F}}_q^n\mid \ell (a)=j\}\), \(T_j\) computes \(f\) with error \(\le \epsilon \). Fix such \(T_j\), and consider \(T_j'\). We claim that \(T_j'\) computes \(f\) with error no more that \(\epsilon +(q-1){{\mathrm{Inf}}}_k(f)\).

To see this, for \(i\in {\mathbb {F}}_q\), \(i\ne j\), define

$$ A|_k^{\ell , j}(f, i)=\mathop {\Pr }_{a\in {\mathbb {F}}_q^n, \ell (a)=i}(f(a)\ne f(a|_k^{\ell , j})). $$

It is obvious that \(T_j'\) computes \(f\) with error \(\le \epsilon + 1/q\cdot (\sum _{i\in {\mathbb {F}}_q, i\ne j}A|_k^{\ell , j}(f, i))\). Now we verify that \(1/q\cdot (\sum _{i\in {\mathbb {F}}_q, i\ne j}A|_k^{\ell , j}(f, i))\le (q-1){{\mathrm{Inf}}}_k(f)\). Fix \(a=(a_1, \dots , a_n)\) from \(\{a\in {\mathbb {F}}_q^n\mid \ell (a)=j\}\). Then the contribution of \((a_1, \dots , a_{k-1}, a_{k+1}, \dots , a_n)\) in \(1/q\cdot (\sum _{i\in {\mathbb {F}}_q, i\ne j}A|_k^{\ell , j}(f, i))\) is \(\frac{1}{q}\cdot \frac{1}{q^{n-1}}\cdot s\), where \(s\in \{0, \dots , q-1\}\) is the number of field elements \(b\) s.t. \(f(a_1, \dots , a_{k-1}, b, a_{k+1}, \dots , a_n)\ne f(a_1, \dots , a_n)\). On the other hand, its contribution in \((q-1)\cdot {{\mathrm{Inf}}}_k(f)\) is \((q-1)\cdot \frac{1}{q^{n-1}}\cdot \frac{s(q-s)}{\left( {\begin{array}{c}q\\ 2\end{array}}\right) }\). Finally note that \(\frac{s}{(q-1)q} \le \frac{s(q-s)}{\left( {\begin{array}{c}q\\ 2\end{array}}\right) }\) for \(q\ge 2\) and \(s\in \{0, \dots , q-1\}\).

As eliminating the first query introduces an extra error of at most \((q-1){{\mathrm{Inf}}}_{\max }(f)\), similar to the argument in proving Theorem 1, we have \(\epsilon +(q-1)D^{\oplus _q}(f)\cdot {{\mathrm{Inf}}}_{\max }(f)\ge {{\mathrm{Var}}}(f)\), therefore proving that

$$ {{\mathrm{Inf}}}_{\max }(f)\ge \frac{1}{q-1}\cdot \frac{{{\mathrm{Var}}}(f)-\epsilon }{D^{\oplus _q}(f)}. $$