Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Modern Electronic Design Automation tools have the capacity to synthesize the control unit from a finite state machine description, or even to extract and synthesize the control unit from a functional description of the complete circuit (Chap. 5). Nevertheless, in some cases the digital circuit designer can himself be interested in performing part of the control unit synthesis. Two specific synthesis techniques are presented in this chapter: command encoding and hierarchical decomposition [1]. Both of them pursue a double objective. On the one hand they aim at reducing the circuit cost. On the other hand they can make the circuit easier to understand and to debug. The latter is probably the most important aspect.

The use of components whose latency is data-dependent has been implicitly dealt with in Sect. 2.5. Some additional comments about variable-latency operations are made in the last section of this chapter.

4.1 Command Encoding

Consider the control unit of Fig. 2.6 and assume that commands is an m-bit vector, conditions a p-bit vector and internal_state an n-bit vector. Thus, the command generation block generates m + 1 binary function of p + n binary variables. Nevertheless, the number s of different commands is generally much smaller than 2m. An alternative option is to encode the s commands with a t-bit vector, with 2t ≥ s. The command generation block of Fig. 2.6 can be decomposed into two blocks as shown in Fig. 4.1: the first one generates t + 1 binary functions of p + n variables, and the second one (the command decoder) m binary functions of t binary variables.

Fig. 4.1
figure 1

Command encoding

A generic circuit-complexity measure is the number of bits that a memory (ROM) must store in order to implement the same functions. Thus, the complexity of a circuit implementing m + 1 functions of p + n variables is

$$ \left( {m + 1} \right)\cdot 2^{p + n} {\text{bits}}, $$
(4.1)

and the total complexity of two circuits implementing t + 1 function of p + n variables and m functions of t variables, respectively, is

$$ \left( {t + 1} \right) \cdot 2^{p + n} + m \cdot 2^{t} {\text{bits}}. $$
(4.2)

Obviously, this complexity measure only takes into account the numbers of outputs and inputs of the combinational blocks, and not the functions they actually implement.

Another generic complexity measure is the minimum number of LUTs (Chap. 1) necessary to implement the functions, assuming that no LUT is shared by two or more functions. If k-input LUTs are used, the minimum number of LUTs for implementing a function of r variables is

$$ \left\lceil {\left( {r - 1} \right)/\left( {k - 1} \right)} \right\rceil {\text{LUTs}}, $$

and the minimum delay of the circuit is

$$ \left\lceil {log_{k} r} \right\rceil \cdot T_{LUT} $$

being T LUT the delay of a k-input LUT.

The complexities corresponding to the two previously described options are

$$ \left( {m + 1} \right) \cdot \left\lceil {\left( {p + n - 1} \right)/\left( {k - 1} \right)} \right\rceil {\text{LUTs}} $$
(4.3)

and

$$ \left( {t + 1} \right) \cdot \left\lceil {\left( {p + n - 1} \right)/\left( {k - 1} \right)} \right\rceil + m \cdot \left\lceil {\left( {t - 1} \right)/\left( {k - 1} \right)} \right\rceil {\text{LUTs}}, $$
(4.4)

and the delays

$$ \left\lceil {log_{k} \left( {p + n} \right)} \right\rceil \cdot T_{LUT} \;{\text{and}}\;(\left\lceil {log_{k} \left( {p + n} \right)} \right\rceil + \left\lceil {log_{k} t} \right\rceil ) \cdot T $$
(4.5)

Example 4.1

Consider the circuit of Sect. 2.5 (scalar_product.vhd, available at the Authors’ web page). The commands consist of 26 bits: eight one-bit signals

and nine two-bit signals

There are four binary conditions:

and the finite-state machine has 40 states. Thus, m = 26, p = 4 and n = 6. Nevertheless, there are only 31 ≪ 226 different commands, namely

that can be encoded with t = 5 bits.

Thus, the complexities in numbers of stored bits (4.1 and 4.2) to be compared are

$$ \left( {m + 1} \right) \cdot 2^{p + n} = 27 \cdot 2^{10} = 27,648\;{\text{bits}}, $$
(4.6)
$$ \left( {t + 1} \right) \cdot 2^{p + n} + m \cdot 2^{t} = 6 \cdot 2^{10} + 26 \cdot 2^{5} = 6,976\;{\text{bits}}, $$
(4.7)

and the complexities in numbers of LUTs (4.3 and 4.4), assuming that 4-input LUTs are used, are

$$ \left( {m + 1} \right) \cdot \left\lceil {\left( {p + n - 1} \right)/3} \right\rceil = 27 \cdot \left\lceil {9/3} \right\rceil = 81\;{\text{LUTS}}, $$
(4.8)
$$ \left( {t + 1} \right) \cdot \left\lceil {\left( {p + n - 1} \right)/3} \right\rceil + m \cdot \left\lceil {\left( {t - 1} \right)/3} \right\rceil = 6 \cdot \left\lceil {9/3} \right\rceil + 26 \cdot \left\lceil {4/3} \right\rceil = 70\;{\text{LUTs}}. $$
(4.9)

The corresponding minimum delays (4.7) are

$$ \left\lceil {log_{k} \left( {p + n} \right)} \right\rceil = \left\lceil {log_{4} 10} \right\rceil = 2T_{LUT} , $$
(4.10)
$$ (\left\lceil {log_{k} \left( {p + n} \right)} \right\rceil + \left\lceil {log_{k} t} \right\rceil ) \cdot T_{LUT} = \left\lceil {log_{4} 10} \right\rceil + \left\lceil {log_{4} 5} \right\rceil = 4T_{LUT} . $$
(4.11)

The second complexity measure (number of LUTs) is surely more accurate than the first one. Thus, according to (4.84.11), the encoding of the commands hardly reduces the cost and increases the delay. So, in this particular case, the main advantage is clarity, flexibility and ease of debugging, and not cost reduction.

4.2 Hierarchical Control Unit

Complex circuits are generally designed in a hierarchical way. As an example, the data path of the scalar product circuit of Sect. 2.5 (Fig. 2.18) includes a polynomial adder (XOR gates), a classic squarer and an interleaved multiplier, and the latter in turn consists of a data path and a control unit (Fig. 4.2). This is a common strategy in many fields of system engineering: hierarchy improves clarity, security, ease of debugging and maintenance, thus reducing development times.

Fig. 4.2
figure 2

Hierarchical circuit

Nevertheless, this type of hierarchy based on the use of previously defined components does not allow for the sharing of computation resources between several components. As an example, one of the components of the circuit of Sect. 2.5 is a polynomial adder, and the interleaved multiplier also includes a polynomial adder. A slight modification of the operation scheduling, avoiding executing field multiplications and additions at the same time, would allow to use the same polynomial adder for both operations. Then, instead of the architecture of Fig. 4.2, a conventional (flat) structure with a data path including only a polynomial adder could be considered. In order to maintain some type of hierarchy, the corresponding control unit could be divided up into a main control unit, in charge of controlling the execution of the main algorithm (scalar product) and a secondary control unit, in charge of controlling the execution of the interleaved multiplication.

Consider another, simpler, example.

Example 4.2

Design a circuit that computes

$$ z = \sqrt {x^{2} + y^{2} } . $$

The following algorithm computes z:

A first solution is to use three components: a squaring circuit, an adder and a square rooting circuit, for example that of Sect. 2.1. The corresponding circuit would include two adders, one for computing c, and the other within the square_root component (Fig. 2.3). Another option is to substitute, in the preceding algorithm, the call to square_root with the corresponding sequence of operations. After scheduling the operations and assigning registers to variables, the following algorithm is obtained:

This algorithm can be executed by the data path of Fig. 4.3.

Fig. 4.3
figure 3

Data path

In order to distinguish between the main algorithm and the square root computation, the control unit can be divided up as shown in Fig. 4.4. A command decoder (Sect. 4.1) is used. There are eight different commands, namely

Fig. 4.4
figure 4

Hierarchical control unit

that are encoded with three bits. The following process describes the command decoder:

The two control units communicate through the start_root and root_done signals. The first control unit has six states corresponding to a “wait for start” loop, four steps of the main algorithm (operations 1, 2, 3, and the set of operations 4–8), and an “end of computation” detection. It can be described by the following process:

The second control unit has five states corresponding to operations 4, 5, 6, and 7, and “end of root computation” detection:

The code corresponding to nop is 000, so that the actual command can be generated by ORing the commands generated by both control units:

A complete VHDL model example4_1.vhd is available at the Authors’ web page.

Comments 4.1

  • This technique is similar to the use of procedures and functions in software generation.

  • In the former example, the dividing up of the control unit was not necessary. It was done only for didactic purposes. As in the case of software development, this method is useful when there are several calls to the same procedure or function.

  • This type of approach to control unit synthesis is more a question of clarity (well structured control unit) and ease of debugging and maintenance, than of cost reduction (control units are not expensive).

4.3 Variable-Latency Operations

In Sect. 2.3, operation scheduling was performed assuming that the computation times t JM of all operations were constant values. Nevertheless, in some cases the computation time is not a constant but a data-dependent value. As an example, the latency t m of the field multiplier interleaved_mult.vhd of Sect. 2.5 is dependent on the particular operand values. In this case, the scheduling of the operations was done using an upper bound of t m . So, an implementation based on this schedule should include “wait for t m cycles” loops. Nevertheless, the proposed implementations (scalar_product.vhd and scalar_product_DF2.vhd) are slightly different: they use the mult_done flag generated by the multiplier. For example, in scalar_product_DF2.vhd (Sect. 2.5), there are several sentences, thus:

In an implementation that strictly respects the schedule of Fig. 2.14, these particular sentences should be substituted by constructions equivalent to

In fact, the pipelined circuit of Fig. 3.6 (pipeline_DF2.vhd) has been designed using such an upper bound of t m . For that, a generic parameter delta was defined and a signal time_out generated by the control unit every delta cycles. On the other hand, the self-timed version of this same circuit (Example 3.4) used the mult_done flags generated by the multipliers.

Thus, in the case of variable-latency components, two options could be considered: a first one is to previously compute an upper bound of their computation times, if such a bound exists; another option is to use a start-done protocol: done is lowered on the start positive edge, and raised when the results are available. The second option is more general and generates circuits whose average latency is shorter. Nevertheless, in some cases, for example for pipelining purpose, the first option is better.

Comment 4.2

A typical case of data-dependent computation time corresponds to algorithms that include while loops: some iteration is executed as long as some condition holds true. Nevertheless, for unrolling purpose, the algorithm should be modified and the while loop substituted by a for loop including a fixed number of steps, such as for i in 0 to n − 1 loop. Thus, in some cases it may be worthwhile to substitute a variable-latency slow component by a constant-latency fast one.

An example of a circuit including variable-latency components is presented.

Example 4.3

Consider again Algorithm 2.3, with the schedule of Fig. 2.17, so that two finite field multipliers are necessary. Assume that they generate output flags done 1 and done 2 when they complete their respective operations. The part of the algorithm corresponding to k mi = 0 can be executed as follows:

The following VHDL model describes the circuit:

A complete model unbounded_DF.vhd is available at the Authors’ web page. Other implementations, using latency upper bounds and/or pipelining or self-timing, are proposed as exercises.

4.4 Exercises

  1. 1.

    Design a circuit that computes z = (x 1 − x 2)1/2 + (y 1 − y 2)1/2 with a hierarchical control unit (separate square rooter control units, see Example 4.2).

  2. 2.

    Design a 2-step self-timed circuit that computes z = (x 1 − x 2)1/4 using two square rooters controlled by a start/done protocol.

  3. 3.

    Design a 2-step pipelined circuit that computes z = (x 1 − x 2)1/4 using two square rooters, with a start input, whose maximum latencies are known.

  4. 4.

    Consider several implementations of the scalar product circuit of Sect. 2.5, taking into account Comment 2.2. The following options could be considered:

    • hierarchical control unit;

    • with an upper bound of the multiplier latency;

    • pipelined version;

    • self-timed version.