Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Digital circuit synthesis [3, 7, 8, 15] from higher level descriptions to technology dependent standard cells is one of the core activities in Electronic Design Automation (EDA), well-studied in academic research and implemented in powerful commercial tools. This is the hardware analog of optimizing compilation, and indispensable tool in producing efficient chips. Traditionally, the major optimization objectives in synthesis have been area and speed, associated with the depth of the circuit from primary inputs to outputs. In the last decades, power consumption has become a no less important performance measure for reasons that need not be repeated here [1, 2, 4, 14]. In this work we develop a new synthesis algorithm geared toward reducing the expected number of switchings in the circuit, an important ingredient in its power consumption. This work can be viewed as part of the trend to apply formal technology (abstract reasoning on Boolean functions and automata) outside the traditional scope of verification, namely handling quantitative properties such as timing and power consumption that were considered non-functional properties, and applying optimization/synthesis rather than evaluation/verification with respect to them.

Figure 1 sketches a possible logic synthesis flow. Starting from a multi-level logic specification, the circuit is brought into a form of an And-Inverter Graph (AIG) consisting solely of and and not gates. This representation is than mapped into a concrete technology of standard cells admitting physical properties such as size and electrical characteristics. Syntactically AIGs are composed from 2and gates but by collapsing together all not-free “cones”, we obtain a semantically-equivalent function constructed from and gates of unbounded fan-in (arity). Part of the technology-dependent mapping can be viewed as decomposing those ands into networks of 2ands and this is the problem we address in this paper.

Fig. 1.
figure 1

A logic synthesis design flow: from multilevel logic specification to And-Inverter graphs to standard cells.

Dynamic power consumption of Boolean gates is associated essentially with their switchings between 0 and 1. In this work we consider synchronous combinational circuits that process sequences of input vectors. For each input vector, a circuit propagates values from input to output ports until it stabilizes and then reads the next input. The overall number of switchings associated with a pair of inputs is the number of gates whose stable value for one input is different from their value for the next input. For one such pair it is possible to steer the synthesis process and obtain a circuit with significantly less switchings compared to other arbitrary circuits that realize the same function. But of course, any circuit will process during its lifetime a long sequence consisting of diverse consecutive pairs of input vectors and optimizing synthesis with respect to all those is a challenging problem.

One natural approach is to define some probability function over sequences of input sequences, induced, for example, by a Markov chain which generates them. However, even the evaluation of the expected number of switchings in a given circuit is an intractable problem for non-trivial probabilistic generators with many input variables. As an alternative we develop in this paper a switching-aware synthesis procedure which optimizes the circuit relative to a reference sequence supposed to represent a typical input. In essence, the algorithm estimates the expected amount of switchings associated with a conjunction of any pair of input variables and then solves an optimal perfect matching problem to decide which variables to pair together as inputs to a 2and gate. The procedure obtains quite a good switching reduction compared to arbitrary realizations of the same function by circuits of the same topology.

We then study the question of optimization with respect to inputs generated by Markov chains of small description size, that is, networks of sparsely-interacting 2-state probabilistic automata. We use such networks to generate the reference (training) sequences and then measure the performance gains on other sequences generated from the same model. We perform experiments on models of varying degree of variable dependencies and other assumptions on the inputs and we obtain significant reduction in switching activity. Finally, we introduce a reduced model of an instruction decoder and evaluate our procedure under probabilistic assumptions concerning the instruction stream.

2 Problem Statement

Our starting point is a Boolean circuit constructed from unbounded fan-in and gates and not gates and our goal is to replace the and gates by 2and gates, yielding a semantically equivalent circuit C. Once we have a good solution for the and-to-2and problem we can apply it to any and in the AIG and solve the synthesis problem for the whole circuit. From now on we consider a function \(f:(x_1,\ldots ,x_n) \mapsto x_1\wedge \cdots \wedge x_n\) and a target circuit C which is a properly structured directed acyclic graph whose nodes are 2ands of the form \(g:(x_1,x_2)\mapsto x_1\wedge x_2\). We denote the input space \(\mathbb {B}^n\) by X and the state-space of C, that is, the set of possible values in the output ports of all its gates, as \(Y=\mathbb {B}^m\). The synthesized circuit C can be viewed as a memoryless transducer from \(X^*\) to \(Y^*\) such that for every t, y[t] is the stable state of the circuit after processing x[t]. The amount of switching in C relative to input x and at time t is

$$S(C,x,t)=\varDelta (y[t-1],y[t])$$

where \(\varDelta \) is the Hamming distance between Boolean vectors. The total amount of switching while reading a sequence \(x\in X^*\) is

$$S(C,x)=\sum _{t=1}^{|x|} S(C,x,t).$$

A circuit C is better than \(C'\) relative to x if \(S(C,x)<S(C',x)\). We want to build circuits which are optimal or reasonable in this sense. A major issue is what to assume about the set of inputs used to evaluate S(C, .). One can think of two approaches.

  1. 1.

    Assume some probability function P on \(X^*\), or more precisely a family of probabilities \(P_k:X^k\rightarrow [0,1]\), defined for example via a Markov chain, and then attempt to optimize the expected number of switchings per time step

    $$S(C,P)=\lim _{k\rightarrow \infty } \sum _{x\in X^k} P_k(x)\cdot S(C,x)/k.$$
  2. 2.

    Use a long reference sequence \(\underline{x}\) and evaluate C according to \(S(C,\underline{x})\).

We will use a mixture of these two approaches. We optimize \(S(C,\underline{x})\) for some training sequence \(\underline{x}\) generated by a Markov chain and then evaluate the synthesized circuit according to the number of switchings that occur while processing other sequences generated from the same chain.

3 Input Pairing for and Gates

The principle underlying switching reduction for and gates is simple. Among all elements of X only a single vector \(\mathbf {1}=(1,\ldots ,1)\) satisfies \(f(x)=1\). Input transitions of the form \(x\rightarrow x'\) such that \(x\not =\mathbf {1}\) and \(x'\not =\mathbf {1}\) will not change the primary output. They can change, however, the values of intermediate gates that realize a conjunction of a subset of the input variables. We call such transitions useless and our goal is to “abort” them as soon as possible.

Fig. 2.
figure 2

For an input transition \({(0,0,0,0,0,0,0,0)} \rightarrow {(0,1,1,1,1,1,1,1)}\) a chain realizations can abort all switchings but a tree cannot.

There are numerous realizations of f by 2ands, all using \(n-1\) gates. Among those, one can single out two extreme topologies, the sequential chain, whose depth is \(n-1\) and the balanced tree with depth \(d=\log n\). Since circuit depth determines propagation delay from input the output, speed considerations favor balanced trees and we will focus in this paper on those. Balanced trees bring some regularity to the problem, allowing us to work recursively on the levels of the tree from 0 (primary inputs) to \(d-1\). Note, however, that chains tend to be more efficient in switching reduction in and gates because they can abort useless switchings earlier as shown in Fig. 2. We have implemented also a version of our procedure, not reported here, which does not commit a priori to the circuit topology and which can be applied when power consumption is considered much more important than latency.

Table 1. (A) The output transitions of an and-gate as a function of the input transitions; (B) The number of switchings associated with every pair \((u\rightarrow u',v\rightarrow v')\) of input transitions; (C) the net switching reduction: number of input switchings minus output switching.

For the fixed balanced tree topology, the synthesis problem reduces to mapping input variables to the circuit input ports. The problem can be phrased recursively as follows. At level i of the tree, \(2^{d-i}\) inputs should be partitioned into pairs to be mapped into \(2^{d-i-1}\) 2and gates. To understand which input signals should be paired together, let us look at Table 1-(A) which shows which transitions are taken by the output as a function of the transitions taken by the inputs. Table 1-(B) shows the number of output switchings in each case while Table 1-(C) shows the net switching reduction effect, namely, the number of input switchings minus the number of output switchings. It is intuitively clear that for one consecutive pair of inputs, we should pair together variables taking respective transitions \(1\rightarrow 0\) and \(0 \rightarrow 1\). Such transitions cancel each other and send as inputs to the next level a variable doing \(0\rightarrow 0\) which will not trigger further switching with any other input it will be paired with Fig. 3 shows two circuits and their performance differences with respect to a single consecutive pair of input vectors.

Fig. 3.
figure 3

Two pairings for input transition \({(0,1,0,1,0,1,0,1)} \rightarrow {(1,0,1,0,1,0,1,0)}\): (a) a bad pairing with 6 switchings; (b) a good pairing with no switchings.

Let \(R_{jk}(u,u',v,v')\) be the probability that a pair \((x_j,x_k)\) of input variables takes the joint transition \((u\rightarrow u',v\rightarrow v')\). When the inputs are generated by a Markov chain, these probabilities can be derived from the steady state of the chain which is, however, typically too hard to compute. Given a reference sequence \({\underline{x}}\), we can approximate \(R_{jk}(u,u',v,v')\) by computing the number of occurences of the given transition in the sequence. Denoting the number of switchings associated with a pair of transition \(u\rightarrow u'\) and \(v\rightarrow v'\) (Table 1-(B)) by \(s(u,u',v,v')\) (always 0 or 1), the expected number of switchings in \(x_j \wedge x_k\) is

$$\mu _{jk}=\sum _{u,u',v,v'}R_{jk}(u,u',v,v')\cdot s(u,u',v,v').$$

Let \(G=(V,E,\mu )\) be a complete graph with n nodes where each edge (jk) is labeled by \(\mu _{jk}\). For the first level of the tree, the problem of finding input pairing which is optimal in terms of expected total number of switching is equivalent to the optimization problem known as minimal-weight perfect matching [12] for G. Once such an optimal pairing is found for level i, the outputs of the gates at this level serve as inputs for the pairing problem of the next level as summarized in Algorithm 1. The first polynomial algorithm for the optimal matching problem dates back to [5] using linear programming. The complexity of the algorithm has been improved in [10] from \(O(n^4)\) to \(O(n^3)\). Thus, together with the computation of \(\mu \) from the training sequence the complexity of our procedure is \(O(n^2\cdot |{\underline{x}}|+n^3)\).

figure b

The results of the algorithm may deviate from the optimal expected number of switchings for three reasons. First, it is not based on the real value of \(\mu \) but on its approximation from the training sequence. Secondly, it works by levels in a level-greedy fashion and hence, in principle, it is not guaranteed to produce the optimal among all circuits. Finally it does only statistics for pairs of variables and ignores more complex dependencies between three or more variables that may influence the outcome. However, as the experimental results show, it constitutes a very effective heuristics. We have implemented the algorithm and evaluated it empirically on purely synthetic examples and then on a realization of an instruction decoder. In the current implementation, since we limit the evaluation to \(n=16\), we find the optimal matching by enumeration.

4 Evaluation: Synthetic Boolean Models

We evaluate our algorithm against different classes of probabilistic 16-dimensional input generators. To define probabilities over \(X^*\) using arbitrary Markov chains we need to handle transition matrices of size at least \(2^n\times 2^n\). For large n even writing down such a matrix is infeasible, not to mention computing its steady state probability. As is common in domains such as probabilistic verification and performance analysis, we use a compositional model consisting of a network of sparsely-interacting probabilistic automata. A probabilistic automaton \(\mathcal{A}=(Q,\varSigma ,\delta )\) is an input-dependent Markov chain where every input letter \(\sigma \in \varSigma \) induces a different transition matrix over state-space Q. The probabilistic transition function is thus of the form \(\delta :Q\times \varSigma \times Q\rightarrow [0,1]\) satisfying

$$\sum _{q'\in Q} \delta (q,\sigma ,q')=1$$

for every q and \(\sigma \). A Markov chain can be viewed as a degenerate probabilistic automaton without an alphabet and a transition function of the form \(\delta :Q\times Q\rightarrow [0,1]\).

Let \(N=\{1,\ldots ,n\}\). A network of n interacting probabilistic automata is given as \(\mathcal{A}=(\mathcal{A}_1,\ldots ,\mathcal{A}_n,h)\) where \(\mathcal{A}_i=(Q_i,\varSigma _i,\delta _i)\) and \(h:N\rightarrow 2^N\) is an influence function such that h(i) is the set of the other automata (besides itself) whose states are observed by \(\mathcal{A}_i\) and influence its transitions. In our network each automaton has a state-space encoded by one bit, \(Q_i=\mathbb {B}\), and an input alphabet \(\varSigma _i=\mathbb {B}^{|h(i)|}\) which is the state-space of the influencing automata. The composition of the automata yields a global Markov chain \((Q,\delta )\) with \(Q=Q_1\times \ldots Q_n=\mathbb {B}^n\). The local input letter read by automaton \(\mathcal{A}_i\) in a global state q is the projection of q on the variables in h(i) that we denote by \(\pi _i(q)\). The transition function of the global Markov chain is defined as

$$\begin{array}{c}\delta ((q_1,\ldots ,q_n),(q'_1,\ldots ,q'_n)) \\ = \\ \delta _1(q_1,\pi _1(q),q'_1)\cdot \delta _2(q_2,\pi _2(q),q'_2) \cdots \delta _n(q_n,\pi _n(q),q'_n).\end{array}$$

The structure of h(i) can be used to classify models according to variable interaction. When the maximum of |h(i)| is small, the system admits a small description from which random sequences for training and evaluation can be generated.

For each class of models we draw model instances randomly and measure the reduction obtained by our algorithm with respect to inputs generated by the model. All model classes share a tuning parameter \(\alpha \in [0,1]\) intended to quantify the degree of regularity in the input sequences which can be exploited to come up with good input pairing. Whenever we need to fix a probability while defining a model instance, we draw it from \(I_\alpha \) defined as

The regularity in the inputs (and the potential effectiveness of our procedure) is monotone decreasing with \(\alpha \). When \(\alpha =0\) the probabilities are taken from \(\{0,1\}\) and the resulting model is deterministic. When \(\alpha =1/2\) the probabilities are drawn from the whole interval [0, 1] and when \(\alpha =1\) all probabilities in the model instances are equal to 1 / 2. In this case there is no regularity in the input, all sequences of states and transitions are uniformly distributed and no switching reduction is expected because any input pairing would be as good as another.

The whole experimental protocol is summarized in Algorithm 2. For each model class and value of \(\alpha \), we draw randomly a set \(\{M_1,\ldots ,M_{50}\}\) of model instances. For each instance \(M_i\) we generate a training sequence \({\underline{x}}_i\) of length 10000, apply our algorithm and synthesize an optimized circuit \(C_i\). We generate an evaluation sequence \(x_i\) of length 10000 and let \({\underline{S}}_i\) be the number of switchings it induces in \(C_i\). Then we draw a set \(\{C_{i1},\ldots C_{i20}\}\) of arbitrary circuits, let \({S}_i\) be the average number of switchings induced by \(x_i\) in these circuits and let \(R_i\) be the relative improvement in \({\underline{S}}_i\) relative to \(S_i\). Finally R is the average reduction over all model instances of the same class.

figure c

Independent Inputs. We start by evaluating the switching reduction for two simple cases where the input variables are independent of each other. The first is the case where the value of each \(x_i\) is drawn according to a stateless Bernoulli process with parameter \(a_i\) while in the second model each bit is generated by an independent Markov chain with parameters \(a_i\) and \(b_i\). The respective transition matrices are:

figure d

For these models \(\mu _{jk}\) is computed analytically (see Table 2) without a training sequence. Figure 4-(a) shows for these two model classes the average reduction obtained by our algorithm as a function of \(\alpha \). In both cases the reduction is around 70 % when the system is close to deterministic and 30 % when probabilities are taken from [0, 1].

Table 2. (a) The probabilities of transition pairs for two sequences generated by: (a) Bernoulli processes with parameters \(a_j\) and \(a_k\); (b) independent Markov chains with parameters \(a_j,b_j\) and \(a_k,b_k\).

Cascades. Next we explore the class of cascade structures where the automata are ordered and each automaton observes the state of some of its predecessors. A network is a cascade of depth k if \(h(i)=\{i-k,\ldots ,i-1\}\) and the number transition matrices for each automaton is \(2^k\). The results for cascades of depth 1 and 2 are plotted in Fig. 4-(b). For depth 1 the reduction ranges from 70 % for close to deterministic inputs to 15 % for \(\alpha =1/2\) while for depth 2 the range is from 50 % to 10 %.

Partitioned Variables. Next we applied our procedure to a network where the variables are partitioned into clusters of size 2 and 4 and each automaton observes only the states of the automata in its cluster. The results are plotted in Fig. 4-(c). For 2-clusters the range or reduction is between 65 % for almost deterministic inputs and 15 % for \(\alpha =0.5\), while for 4-clusters the corresponding reductions are less than 50 % and 10 %.

Arbitrary Sparse Network. In the last class of examples we consider arbitrary networks where each automaton observes the states of k randomly chosen other automata. Figure 4-(d) shows the results obtained for \(k=2\) and 4. In the former case we obtain 45 % for \(\alpha =0.05\) and around 5 % for \(\alpha =0.5\), while for the latter we obtain the worst results: less than 10 % for quasi-deterministic inputs and less than 5 % when probabilities are drawn anywhere in [0, 1].

Table 3 shows the average number of absolute switching elimination per gate in one time step. Upon closer inspection we observe that the results become consistently worse as the number of variables observed by an automaton becomes larger, quite independently of the interaction pattern. This may be an artifact of the way we generate model instances. The reason is that when an automaton has several transition matrices, the values of an entry (uv) in different matrices may be taken from opposite sides of \(I_\alpha \), cancel each other an render the behavior of the variables more random and less regular.

Fig. 4.
figure 4

The average switching reduction as a function of the uniformity parameter \(\alpha \) for different input models: (a): Independent inputs – Bernoulli (dashed red) and Markov processes. (b): Variables are arranged in a cascade structure of depth 1 (dashed red) and 2 (c): Variables are partitioned into mutually-dependent clusters of size 2 (dashed red) and 4 (d): Each variable depends on 2 (dashed red) and 4 other arbitrary variables(Color figure online).

Table 3. The absolute reduction in number of switching per gate per time step for all the models.

5 Evaluation: A Mini Instruction Decoder

Finally, we synthesize a mini instruction decoder, where we apply our procedure to a full AIG. We consider a very simple hand-held calculator whose instructions are listed in Table 4. The instruction are encoded using 4 bits although 3 bits would suffice, to reflect the fact that in a real application often not all the possible input combinations are used.

Table 4. The instruction set of the calculator.

We assume that the typical use of the calculator will be just to perform an operation (add, subtract, multiply, divide) on two numbers entered from the numeric keypad. More sophisticated users might perform more complex operations, say add three numbers at once, but with a lower probability. The Markov model for instruction sequences is depicted in Fig. 5 and explained below:

  1. 1.

    With probability \(p_{lm}\) load an argument previously stored in memory, otherwise just type in some number as the first argument.

  2. 2.

    Press one of \(\{+,-,\times ,\div \}\) with respective probabilities \(\{p_{add},p_{sub},p_{mul},p_{div}\}\).

  3. 3.

    Load the second argument either from memory (probability \(p_{lm}\)) or by typing the number.

  4. 4.

    Evaluate by pressing ‘\(=\)’ and then with probability \(P_{sm}\) store the result in memory.

Fig. 5.
figure 5

The probabilistic model of the instruction generator.

Fig. 6.
figure 6

A comparison of the number of switchings in the optimized instruction compared to 20 other arbitrary realizations. The height of bars shows how much switching can be saved using the optimized circuit compared to that realization.

For the experiment we set the parameters of the model as follows:

$$ \begin{array}{lll} p_{lm}= 0.1 &{} p_{add} = 0.4 &{} p_{sub} = 0.3 \\ p_{mul} = 0.2 &{} p_{div} = 0.1 &{} p_{sm} = 0.1 \end{array} $$

We generate from the model a training sequence of size 20000 and use it to synthesize an optimized circuit denoted by OC. For evaluation purposes we generate an input sequence of length 100000 and compare the number of switchings it induces in OC with 20 randomly drawn implementations of the decoder. The results are shown in Fig. 6. Note that there is a large variation in the number of switchings among the different realizations. Circuit OC was always better than any of the other circuits and on the average achieved a reduction of 16.49 %. Naturally these results are also sensitive to the uniformity of the probabilities. For example when we set \(p_{lm}=0.25\) and \(P_{sm}=0.2\) we obtained a smaller reduction of 12.53 %.

6 Discussion

The interest in switching reduction and in the evaluation of circuit behavior against probabilistic models in general [6] is not new. Concerning switching reduction we can distinguish between an abstract approach like ours which focuses only on the number of transitions as an approximate indicator of power consumption and more physical approaches that map abstract circuits onto a concrete technology where power consumption can be measured more accurately. The work of [16] which belongs to the second category, mentions the abstract problem that we solve here as a suggestion for future work that could be plugged upstream to their own work on power-aware mapping using a real technology library. The work of [19] is also of this type, mapping abstract AIGs to real gates. The input is specified as a set of input vectors (patterns) and simulation with these patterns is used to estimate power consumption for different mappings alternatives onto real gates from a library.

The work of [17, 18] applies a similar reasoning concerning input pairing for 2and gates and uses a variant of Huffman’s algorithm for constructing a binary tree with minimal average weighted path length [9]. However, this work is restricted to the case were variables are assumed to be generated by independent Bernoulli processes while our approach is applicable to any small-description Markov process or any user-provided training sequence. Moreover, they use a greedy pairing algorithm such that at each step of the algorithm one pair of variables, the one which induces the least expected number of switching is selected as an input to an and gate. Experiments show that our scheme which treats at once a complete level of the tree via optimal matching is significantly more efficient.

The work of [11] also uses Huffman’s algorithm but in a different way that seems to yield a random balanced tree. They do not give any explicit probabilistic model but introduce some delay assumptions and claim their algorithm to be optimal in terms of reducing only the switching activity which is due to glitches. This is the place to mention that as we do not model gate delays, we cannot detect glitches but one may argue that their importance in balanced trees structures is less pronounced. The work of [18] is extended significantly in [20] who give an optimal algorithm for unbounded depth 2and synthesis, restricted to a Bernoulli input model. Their algorithm tends to produce deep circuits with long delays.

To summarize, we devised a novel procedure for an early step in the synthesis flow for digital circuits/functions. The major novelty of the algorithm is its ability to approximate in a tractable manner, polynomial in the number of inputs to an and gate, the minimal average-case number of switchings, based on a training input sequence. The approach can be applied, in principle to any probabilistic model of the input but, of course, formal guarantees of approximation quality can be given only in restricted cases.

For synthetic empirical evaluation we developed an original framework based on sparsely interacting networks of probabilistic automata and ran extensive experiments under various probabilistic models of the input. The reduction obtained on these synthetic examples were quite impressive, reaching, in some cases, dozens of percents. Two major open questions remain concerning their transfer to real life:

  1. 1.

    Can such reduction be pushed downstream to the more physical steps of synthesis? This question has two versions: can it be done using existing commercial tools that carry a lot of legacy, or can it done in principle by new tools if this type of optimization criterion is considered important.

  2. 2.

    How do real applications look like in terms of circuit structure and input model?

We made a preliminary exploration of the second question using the instruction decoder model and the results seem encouraging. We believe the behavior of real circuits is much more regular than arbitrary Markov chains. In the future we intend to attack larger industrial-scale examples and follow them, as far as possible, down to technology-dependent mapping, being able to detect timing effects and measure real power consumption. It has already been observed that synthesis is an old technology and the outcome of commercial synthesis tools is sensitive to many syntactic features [13] and we hope that this work will bring a fresh look on the topic.

On the theoretical side we intend see under what assumptions our level-greedy algorithm is optimal and to give bounds on its deviation from the optimum when it is not. Another potential direction for exploration is to present trade-offs between speed and switching reduction by being less committed to the circuit topology. Although typically the number of inputs to a single and cone need not be very large, it would be interesting to explore how far we can go with the number of inputs using the polynomial algorithm for optimal matching. Finally we intend to extend this work to sequential machines and to explore the application of switching-oriented reasoning to the encoding of states and symbolic inputs.