Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

In a number of computer arithmetic applications, decimal systems are preferred to the binary ones. The reasons come, not only from the complexity of coding/decoding interfaces but, mostly from the lack of precision in the results of the binary systems.

For the following circuits the input and output operators are supposed to be encoded in Binary Encoded Digits (BCD).

11.1 Addition

Addition is a primitive operation for most arithmetic functions, and then it deserves special attention. The general principles for addition are in Chap. 7, and in this section we examine special consideration for efficient implementation of decimal addition targeting FPGAs.

11.1.1 Decimal Ripple-Carry Adders

Consider the base-B representations of two n-digit numbers

$$ \begin{gathered} x = x_{n - 1} \cdot B^{n - 1} + x_{n - 2} \cdot B^{n - 2} + \cdots + \, x_{0} \cdot B^{0} , \hfill \\ y = y_{n - 1} \cdot B^{n - 1} + y_{n - 2} \cdot B^{n - 2} + \cdots + y_{0} \cdot B^{0} . \hfill \\ \end{gathered} $$

The following (pencil and paper) Algorithm 11.1 computes the (n + 1)-digit representation of the sum z = x + y + c in where c in is the initial carry.

Algorithm11.1: Classic addition (ripple carry)

For B = 10, the classic ripple-carry for a BCD decimal adder cell can be implemented as suggested in Fig. 11.1 The mod-10 addition is performed adding 6 to the binary sum of the digits, when a carry for the next digit is generated. The VHDL model ripple_carry_adder_BCD.vhd is available at the Authors’ web page.

Fig. 11.1
figure 1

Decimal ripple-carry adder cell

As described in Chap. 7, the naïve implementation of an adder (ripple-carry, Fig. 7.1) has a meaningful critical path. In order to reduce the execution time of each iteration step, Algorithm 11.1 can be modified as shown in the next section.

11.1.2 Base-B Carry-Chain Adders

In order to improve the ripple-carry adder, a better solution is the use of two binary functions of two B-valued variables, namely the propagate (P) and generate (G) functions .

$$ \begin{gathered} p_{i} = p\left( {x_{i} ,y_{i} } \right) = 1\,{\text{if}}\,x_{i} + y_{i} = B - 1,\,p\left( {x_{i} ,y_{i} } \right) = 0\,{\text{otherwise}}; \hfill \\ g_{i} = g\left( {x_{i} ,y_{i} } \right) = 1\,{\text{if}}\,x_{i} + y_{i} \ge B,\,g\left( {x_{i} ,y_{i} } \right) = 0\,{\text{if}}\,x_{i} + y_{i} \le B - 2,\,{\text{otherwise}},\,{\text{any}}\,{\text{value}}. \hfill \\ \end{gathered} $$

So, c i + 1 can be expressed under the following way:

$$ c_{i + 1} = p_{i} \cdot c_{i} + {\text{not}}\,\left( {p_{i} } \right) \cdot g_{i} . $$

The corresponding modified Algorithm 11.2 is the following one.

Algorithm 11.2: Carry-chain addition

The use of propagate and generate functions allow generating a n-digit adder carry-chain array of Fig. 11.1. It is based on the Algorithm 11.2. The Generate-Propagate (G-P) cell calculates the Generate and Propagate functions; and the carry-chain (Cy.Ch) cell computes the next carry. Observe that the carry-chain cells are binary circuits, whereas the generate-propagate and the mod B sum cells are B-ary ones. As regards the computation time, the critical path is shaded in Fig. 11.2. (It has been assumed that T sum  > T Cy. Ch)

Fig. 11.2
figure 2

n-digits carry-chain adder

11.1.3 Base-10 Carry-Chain Adders

If B = 10, the carry-chain circuit remains unchanged but the P and G functions as well as the modulo-10 sums are somewhat more complex. In base 2 (Chap. 7), the P and G cells are respectively synthesized by XOR and AND functions, while in base 10, P and G are now defined as follows:

$$ p_{i} = 1\,{\text{if}}\,x_{i} + y_{i} = 9,\,p_{i} = 0\,{\text{otherwise}}; $$
(11.1)
$$ g_{i} = 1\,{\text{if}}\,x_{i} + y_{i} > 9,\,g_{i} = 0\,{\text{otherwise}}; $$
(11.2)

A straightforward way to synthesize P and G is shown at Fig. 11.3a. That is add the BCD numbers and detects if the sum is equal to nine and greater than nine respectively. Nevertheless, functions P and G may be directly computed from x i , y i inputs. The following formulas (11.3) and (11.4) are Boolean expressions of conditions (11.1) and (11.2).

$$ p_{i} = P_{0} \cdot [K_{ 1} \cdot (P_{ 3} \cdot K_{ 2} \vee K_{ 3} \cdot G_{ 2} ) \vee G_{ 1} \cdot K_{ 3} \cdot P_{ 2} ] $$
(11.3)
$$ g_{i} = G_{0} \cdot [P_{ 3} \vee G_{ 2} \vee P_{ 2} \cdot G_{ 1} ] \vee G_{ 3} \vee P_{ 3} \cdot P_{ 2} \vee P_{ 3} \cdot P_{ 1} \vee G_{ 2} \cdot P_{ 1} \vee G_{ 2} \cdot G_{ 1} $$
(11.4)

where \( P_{j} = x_{j} \oplus y_{j,} \,G_{j} = x_{j} \cdot y_{j} \) and \( K_{j} = x_{j}^{'} \cdot y_{j}^{'} \) are the binary propagator, generator and carry-kill for the jth components of the BCD digits x(i) and y(i).

Fig. 11.3
figure 3

a Simple G-P cell for BCD adder. b Carry-chain BCD adder ith digit computation

The BCD carry-chain adder ith digit computation is shown at Fig. 11.3b and it is made of a first binary mod 16 adder stage, a carry-chain cell driven by the G-P functions, and an output adder stage performing a correction (adding 6) whenever the carry-out is one. Actually, a zero carry-out c(i + 1) identifies that the mod 16 sum does not exceed 9, so no corrections are needed. Otherwise, the add-6 correction applies. Naturally, the G-P functions may be computed according to Fig. 11.3, using the outputs of the mod 16 stage, including the carry-out s 4.

The VHDL model cych_adder_BCD_v1.vhd that implements a behavioral model of the decimal carry-chain adder is available at the Authors’ web page.

11.1.4 FPGA Implementation of the Base-10 Carry-Chain Adders

The FPGA vendors provides dedicated resources to implement binary carry-chain adders [1, 2]. As mentioned in Chap. 7 a simple HDL description of a binary adder is implemented efficiently using the carry-logic. Nevertheless in order to use this resources in other designs it is necessary to instantiate the components manually.

Figure 11.4 shows a Xilinx implementation of the decimal carry-chain adder. The VHDL model cych_adder_BCD_v2.vhd that implements a low level component instantiation model of the decimal carry-chain adder is available at the Authors’ web page.

Fig. 11.4
figure 4

FPGA implementation of an N-digit BCD Adder

Observe that the critical path includes a 4-bit adder, the G-P computation; the n-digits carry propagation and a final 3-bit correction adder. Xilinx 6-input/2-output LUT is built as two 5-input functions while the sixth input controls a 2-1 multiplexor allowing to implement either two 5-input functions or a single 6-input one; so G and P functions fit in a single LUT as shown at Fig. 11.5a.

Other alternatives to implement in FPGA the decimal carry-chain adders, include computing the functions P and G directly from the decimal digits (x(i), y(i) inputs) using the Boolean expressions, instead of the intermediate sum bits s k [3].

For this purpose one could use formulas (11.3) and (11.4), nevertheless, in order to minimize time and hardware consumption the implementation of p(i) and g(i) is revisited as follows. Remembering that p(i) = 1 whenever the arithmetic sum x(i) + y(i) = 9, one defines a 6-input function pp(i) set to be 1 whenever the arithmetic sum of the first 3 bits of x(i) and y(i) is 4. Then p(i) may be computed as:

$$ p(i) = (x_{0} (i) \oplus y_{0} (i)) \cdot pp(i). $$
(11.5)

On the other hand, gg(i) is defined as a 6-input function set to be 1 whenever the arithmetic sum of the first 3 bits of x(i) and y(i) is 5 or more. So, remembering that g(i) = 1, whenever the arithmetic sum x(i) + y(i) > 9, g(i), may be computed as:

$$ g(i) = gg(i) \vee (pp(i) \cdot x_{0} (i) \cdot y_{0} (i)). $$
(11.6)

As Xilinx LUTs may compute 6-variable functions, then gg(i) and pp(i) may be synthesized using 2 LUTs in parallel while g(i) and p(i) are computed through an additional single LUT as shown at Fig. 11.5b.

11.2 Base-10 Complement and Addition: Subtration

11.2.1 Ten’s Complement Numeration System

B’s complement representation general principles are available in the literature. One restricts to 10’s complement system to cope with the needs of this section. A one-to-one function R(x), associating a natural number to x is defined as follows.

Every integer x belonging to the range –10n/2 ≤ x < 10n/2, is represented by R(x) = x mod 10n, so that the integer represented in the form \( x_{n - 1} x_{n - 2} \cdots x_{1} x_{0} \) is

$$\begin{aligned}x_{n - 1} \cdot 10^{n - 1} &+ x_{n - 2} \cdot 10^{n - 2} + \cdots + x_{0} \;{\text{if}}\,x_{n - 1} \cdot 10^{n - 1} + x_{n - 2} \cdot 10^{n - 2} + \cdots\\ &+ x_{0} < 10^{n} / 2, \end{aligned}$$
(11.7)
$$ \begin{aligned}x_{n - 1} \cdot 10^{n - 1} &+ x_{n - 2} \cdot 10^{n - 2} + \cdots + x_{0} - 10^{n} \;{\text{if}}\,x_{n - 1} \cdot 10^{n - 1} + x_{n - 2} \cdot 10^{n - 2} + \cdots\\& + x_{0} \ge 10^{n} / 2. \end{aligned}$$
(11.8)

The conditions (11.7) and (11.8) may be more simply expressed as

$$ x_{n - 1} \cdot 10^{n - 1} + x_{n - 2} \cdot 10^{n - 2} + \cdots + x_{0} \;{\text{if}}\,x_{n - 1} < 5, $$
(11.9)
$$ x_{n - 1} \cdot 10^{n - 1} + x_{n - 2} \cdot 10^{n - 2} + \cdots + x_{0} - 10^{n} \;{\text{if}}\,x_{n - 1} \ge 5. $$
(11.10)

Another way to express a 10’s complement number is:

$$ x_{n - 1}^{'} \cdot 10^{n - 1} + x_{n - 2} \cdot 10^{n - 2} + \cdots + x_{0} $$
(11.11)

where \( x_{n - 1}^{'} = x_{n - 1} -- 10\,{\text{if}}\,x_{n - 1} \ge 5 \) and \( x_{n - 1}^{'} = x_{n - 1} \,{\text{if}}\,x_{n - 1} < 5, \) while the sign definition rule is the following one: if x is negative then x n – 1 ≥ 5; otherwise x n − 1 < 5.

11.2.2 Ten’s Complement Sign Change

Given an n-digit 10’s complement integer x, the inverse z = −x of x, is an n-digit 10’s complement integer. Actually the only case −x cannot be represented with n digits is when \( x = - 10^{n} / 2,{\text{ so}} - x = 10^{n} / 2. \) The computation of the representation of −x is based on the following property. Assuming x to be represented as an n-digit 10’s complement number R(x), −x may be readily computed as

$$ - x = 10^{n + 1} - R\left( x \right). $$
(11.12)

A straightforward inversion algorithm then consists of representing x with n + 1 digits, complementing every digit to 9, then adding 1. Observe that sign extension is obtained by adding a digit 0 to the left of a positive number, or 9 for a negative number, respectively.

11.2.3 10’s Complement BCD Carry-Chain Adder-Subtractor

To compute X + Y similar algorithm as in Algorithm 11.2 (Sect. 11.1.2) can be used. In order to compute X − Y, 10’s complement subtraction algorithm actually adds (−Y) to X.

10’s complement sign change algorithm may be implemented through a digitwise 9’s complement stage followed by an add-1 operation. It can be shown that the 9’s complement binary components \( w_{ 3} ,w_{ 2} ,w_{ 1} ,w_{0} \) of a given BCD digit \( y_{ 3} ,y_{ 2} ,y_{ 1} ,y_{0} \) are expressed as

$$ w_{ 3} = y^{\prime}_{3} \cdot y^{\prime}_{2} \cdot y^{\prime}_{1} ;\quad w_{ 2} = y_{ 2} \oplus y_{ 1} ;\quad w_{ 1} = y_{ 1} ; \quad w_{0} = y^{\prime}_{0} $$
(11.13)

An improvement to the adder stage could be carried out by avoiding the delay produced by the 9’s complement step. Thus, this operation may be carried out within the first binary adder stage, where p(i) and g(i) are computed as

$$ \begin{gathered} p_{0} \left( i \right) = x_{0} \left( i \right) \oplus y_{0} \left( i \right) \oplus \left( {A^{'} /S} \right);\,p_{ 1} \left( i \right) = x_{ 1} \left( i \right) \oplus y_{ 1} \left( i \right);\,p_{ 2} \left( i \right) = x_{ 2} \left( i \right) \oplus y_{ 2} \left( i \right) \oplus y_{ 1} \left( i \right).\left( {A^{'} /S} \right) \hfill \\ p_{3} \left( i \right) = x_{ 3} \left( i \right) \oplus \left( {y_{ 3} \left( i \right)^{'} \cdot y_{ 2} \left( i \right)^{'} \cdot y_{ 1} \left( i \right)^{'} } \right) \cdot \left( {A^{'} /S} \right) \oplus y_{ 3} \left( i \right) \cdot \left( {A^{'} /S} \right)^{'} \hfill \\ \end{gathered} $$
(11.14)
$$ g_{k} \left( i \right) = x_{k} \left( i \right),\,\forall k. $$
(11.15)

A third alternative is computing G and P directly from the input data. As far as addition is concerned, the P and G functions may be implemented according to formulas (11.4) and (11.5). The idea is computing the corresponding functions in the subtract mode and then multiplexing according to the add/subtract control signal A′/S.

11.2.4 FPGA Implementations of Adder Subtractors

To compute X – Y, 10’s complement subtraction algorithm actually adds (−Y) to X. So for a first implementation, Fig. 11.6 presents a 9’s complement implementation using 6-input/2-output LUTs, available in the Xilinx (Virtex-5, Virtex-6, spatan6, 7-series) technology. A′/S is the add/subtract control signal; if A/S = 1 (subtract), formulas (11.13) apply, otherwise A/S = 0 and \( w_{j} \left( i \right) = y_{j} \left( i \right)\forall i,j. \)

Fig. 11.5
figure 5

FPGA carry-chain for decimal addition. a. P-G calculation using an intermediate addition. b. P-G calculation directly from BCD digits

The complete circuit is similar to the circuit of Fig. 11.4 , but instead of input y, the input w as produced by the circuit of Fig. 11.5.

Fig. 11.6
figure 6

FPGA 9’s complement circuit for BCD digit

The better alternative intended to avoid the delay produced by the 9’s complement step, embeds the 9’s complementation within the first binary adder stage, as depicted in Fig. 11.7a, where p(i) and g(i) are computed as explained in (11.14) and (11.15).

Fig. 11.7
figure 7

FPGA implementation of adder-subtractor. a Adder-subtractor for one BCD digit. b Direct computation of P-G function

The VHDL model addsub_1BDC.vhd that implements the circuit of Fig. 11.7a is available at the Authors’ web page. The VHDL model has two architectures a behavioral and a low level that instantiates components (Luts, muxcy, xorcy, etc.).

Then we can use the circuit of Fig. 11.6a to compute the P-G function using the previous computed addition of BDC digits. The VHDL model addsub_BDC_v1.vhd that implements a complete decimal adder-subtractor and it is available at the Authors’ web page. To complete the circuit, a final correction adder (correct_add.vhd) corrects the decimal digit as a function of the carries.

The third alternative is computing G and P directly from the input data. For this reason, assuming that the operation at hand is X + (±Y), one defines on one hand ppa(i) and gga(i) according to (11.4) and (11.5) (Sect. 11.1.4), i.e. using the straight values of Y’s BCD components. On the other hand, pps(i) and ggs(i) are defined using w k (i) as computed by the 9’s complement circuit (11.13). As w k (i) are expressed from the y k (i) both pps(i) and ggs(i) may be computed directly from x k (i) and y k (i). Then the correct pp(i) and gg(i) signal is selected according to the add/subtract control signal A/S. Finally, the propagate and generate function are computed as:

$$ p\left( i \right) = (x_{0} \left( i \right) \oplus y_{0} \left( i \right) \oplus (A^{\prime}/S)) \cdot pp\left( i \right), $$
(11.16)
$$ g\left( i \right) = gg\left( i \right) \vee (pp\left( i \right) \cdot x_{0} \left( i \right) \cdot (y_{0} \left( i \right) \oplus (A^{\prime}/S))). $$
(11.17)

Figure 11.7b shows the Xilinx LUT based implementation. The multiplexers are implemented using dedicated muxF7 resources.

The VHDL model addsub_BDC_v2.vhd that implements the adder-subtractor using for the P-G computation from the direct inputs (carry-chain_v2.vhd) is available at the Authors’ web page.

11.3 Decimal Multiplication

This section presents approaches for BCD multipliers. It starts by discussing algorithmic alternatives to implement a 1 × 1 BCD digit multiplier. Then, this section proposes an implementation for an N × 1 BCD multiplier. This operation can be used to carried out an N × M multiplier, but by itself, the N × 1 multiplication appears to be a primitive for other algorithms, such as logarithm or exponential functions.

11.3.1 One-Digit by One-Digit BCD Multiplication

11.3.1.1 Binary Arithmetic with Correction

The decimal product can be obtained through a binary product and a post correction stage [4, 5]. Let A and B be two BCD digits (a 3 a 2 a 1 a 0) and (b 3 b 2 b 1 b 0) respectively. The BCD coded product consists of two BCD digits D and C such that:

$$ A{\ast} B = D{\ast} 10 + {\text{C}} $$
(11.18)

AB is first computed as a 7-bit binary number P(6:0) such that

$$ A{\ast} B = P = p_{ 6} \,p_{ 5} \,p_{ 4} \,p_{ 3} \,p_{ 2} \,p_{ 1} \,p_{0} $$
(11.19)

Although a classic binary-to-BCD decoding algorithm can be used, it can be shown that the BCD code for P can be computed through binary sums of correcting terms described in Fig. 11.8. The first row in Figure shows the BCD weights. The weights of p 3, p 2, p 1 and p 0 are the same as those of the original binary number “\( p_{ 6} \,p_{ 5} \,p_{ 4} \,p_{ 3} \,p_{ 2} \,p_{ 1} \,p_{0} \)”. But weights 16, 32 and 64 of p 4, p 5, and p 6 have been respectively decomposed as (10, 4, 2), (20, 10, 2) and (40, 20, 4). Observe that “p 3 p 2 p 1 p 0” could violate the interval [0, 9], then an additional adjust could be necessary.

Fig. 11.8
figure 8

Binary to BCD arithmetic reduction

First the additions of Row 1, 2, 3, and correction of “p 3 p 2 p 1 p 0” are completed (least significant bit p 0 is not necessary in computations). Then the final correction is computed.

One defines (Arithmetic I):

  1. 1.

    the binary product \( A{\ast} B = P = p_{ 6} \,p_{ 5} \,p_{ 4} \,p_{ 3} \,p_{ 2} \,p_{ 1} \,p_{0} \)

  2. 2.

    the Boolean expression: \( adj_{ 1} = p_{ 3} \wedge (p_{ 2} \vee p_{1} ), \)

  3. 3.

    the arithmetic sum: \( dcp = p_{6}\, p_{5}\, p_{4}\, p_{3}\, p_{2}\, p_{1}\, p_{0} + 0\,p_{6}\, p_{5} \,0\,p_{4}\, p_{3} + 0\,0\,0\,0p_{6}\, p_{5}\, 0 + 0\,0\,0\,0\,adj_{i} \,\,adj_{1} \,0, \)

  4. 4.

    the Boolean expression: \( adj_{ 2} = (dcp_{ 3} \wedge (dcp_{ 2} \vee dcp_{ 1} )) \vee (p_{ 5} \wedge p_{ 4} \wedge p_{ 3} ). \)

One computes

$$ dc = dcp + 0\,0\,0\,0\,adj_{ 2} \,adj_{ 2}\; 0. $$
(11.20)

Then

$$ D = dc_{ 7} \,dc_{ 6} \,dc_{ 5} \,dc_{4} \,{\text{and}}\,C = dc_{ 3} \,dc_{ 2} \,dc_{ 1} \,dc_{0} $$

A better implementation can be achieved using the following relations (Arithmetic II):

  1. 1.

    the product \( A{\ast} B = P = p_{ 6} \,p_{ 5} \,p_{ 4} \,p_{ 3} \,p_{ 2} \,p_{ 1} \,p_{0} \)

  2. 2.

    compute:

    $$ \begin{gathered} cc = p_{ 3} \,p_{ 2} \,p_{ 1} \,p_{0} + 0\,p_{ 4} \,p_{ 4} \,0 + 0\,p_{ 6} \,p_{ 5} \,0, \hfill \\ dd = p_{ 6} \,p_{ 5} \,p_{ 4} + 0\;p_{ 6}\, p_{ 5} \hfill \\ \end{gathered} $$

(cc is 5 bits, dd has 4 bits, computed in parallel)

  1. 3.

    define:

    $$ cy1 = 1\,{\text{iff}}\,cc > 1 9,\,cy0 = 1\,{\text{ iff}}\,{ 9} < cc < 20 $$

(cy1 y cy2 are function of cc 3 cc 2 cc 1 cc 0, and can be computed in parallel)

  1. 4.

    compute:

    $$ c = cc_{ 3} \,cc_{ 2} \,cc_{ 1} \,cc_{0} \, + cy1\left( {cy1\,{\text{or}}\,cy0} \right)\,cy\,0\;0, $$
    (11.21)
    $$ d = dd_{ 3} \,dd_{ 2} \,dd_{ 1} \,dd_{0} + 0\,0\,cy_{ 1} \,cy_{0} $$

(c and d calculated in parallel)

Compared with the first approach, the second one requires smaller adders (5 and 4-bit vs. 8-bit) and the adders can operate in parallel as well. The VHDL models bcd_mul_arith1.vhd and bcd_mul_arith2.vhd that described the previous method for digit by digit multiplication are available at the Authors’ web page.

11.3.1.2 Using ROM

Actually a (100 × 8)-bit ROM can fit to store all the AB multiplications. However, as A and B are two 4-bit operands (BCD digits) the product can be mapped into a 28 × 8-bit ROM. In FPGA devices there mainly two main memory recourses: Block and distributed RAMS. For the examples in Xilinx devices two main possibilities are considered: Block RAMs BRAM’s or distributed RAM’s (LUT based implementation of RAM’s).

BRAM-based implementation: Xilinx Block RAM’s are 18-kbit (or 36-kbits) configurable and synchronous dual-port RAM (or ROM) blocks. They can be configured into different layouts. The 211 × 8-bit configuration option has been selected, though wasting some memory capacity. As BRAM is dual-port, two one-by-one digit multiplications can be implemented in a single BRAM and in a single clock cycle. However the main characteristic to cope with is that BRAM’s are synchronous, so either the address or the output should be registered.

DistributedRAM (LUT-based): 6-input LUT’s can be configured as 64 × 1-bit RAM’s. Moreover the four 6-LUTs in a slice has additional multiplexers to implement 256 × 1 bit RAMs. Then the 28 × 8-bit ROM can be implemented using 8 slices (32 LUT’s).

A trivial optimization reduces the area considering the computation of the BCD final result components c 0 and d 3 straightforward. Actually \( c_{0} = a_{0} \wedge b_{0} \) and \( d_{3} = a_{0} \wedge b_{0} \wedge a_{ 3} \wedge b_{ 3} . \) That is, c 0 is related to the parity while d 3 emphasizes that most significant bit is set for one in only one case (decimal 9 × 9 = 81). It is thus possible to reduce the required memory size to a 28 × 6-bit ROM only plus two LUTs to implement c 0 and d 3 .

Comments

  1. 1.

    BRAM-based design is fast but synchronous. It is useless for combinational implementations, but suitable for sequential and pipelined ones.

  2. 2.

    The existence of “do-not-care” conditions in the memory definition allows the synthesizer to reduce the effective memory requirement.

The VHDL model bcd_mul_bram.vhd implements the BRAM based implementation for the digit by digit multiplication. Additionally, and bcd_mul_mem1.vhd and bcd_mul_mem2.vhd provides the LUT based implementation of decimal BCD multiplication. These models with the corresponding test bench (test_mul_1by1BCB.vhd) are available at the Authors’ web page.

11.3.2 N by One BCD Digit Multiplier

A N × 1 BCD digit multiplier is readily achieved through N 1 × 1-digit multiplications followed by a BCD decimal addition. Fig. 11.9 shows how the partial products are arranged to feed the BCD N-digit adder stage. The carry-chain adder of Sect. 13.1 can be used.

Fig. 11.9
figure 9

N by one digit circuit multiplication. It includes N one by one digit multipliers and a fast carry-chain adder

The VHDL model mult_Nx1_BCD.vhd and mult_Nx1_BCD_bram.vhd that describe the N by one decimal multiplier are available at the Authors’ web page.

11.3.3 N by M Digits Multiplier

Using the previously designed N by one digit multiplier it is possible to perform the N × M digits multiplication. The best area compromise is obtained using a sequential implementation that uses an N × 1 digits multiplier and an N + 1 digit adder. Figure 11.10 show the scheme of the Least Significant Digit (LSD) first algorithm implemented. The B operand is shifted right at each clock cycle. The LSD digit of B is multiplied by the A operand and accumulated in the next cycle to shorten the critical path. After M + 1 cycles, the N + M digit result is available.

Fig. 11.10
figure 10

A sequential N × M multiplier

The VHDL models mult_BCD_seq.vhd and mult_BCD_bram_seq.vhd that describes the N by M digits decimal multiplier and the matching test bench (test_mult_BCB_seq.vhd) is available at the Authors’ web page.

Fully combinational implementations of N × M digits multipliers are possible also based on the N × 1 digit multiplication cell. After the first multiplication, an adder tree sums up the intermediate result and gives the N + M digit result. Figure 11.11 show an N × 8 digit multiplication example.

Fig. 11.11
figure 11

An example of a paralell N × M multiplier (N × 8)

The VHDL model mult_BCD_comb.vhd that describes the N by M digits decimal multiplier and the test bench (test_mult_BCB_comb.vhd) are available at the Authors’ web page.

11.4 Decimal Division

As described in Chap. 9, the basic algorithms are based on digit recurrence. Using Eqs. (9.1), (9.2) and (9.3) at each step of (9.2) q −(i + 1) and r i+1 are computed in function of r i and y in such a way that \( 10 \cdot r_{i} = q_{ - (i + 1)} y + r_{i + 1} ,\,{\text{with}}\, - y \le r_{i + 1} < y, \) that is

$$ r_{i + 1} = 10 \cdot r_{i} - q_{ - (i + 1)} y,\,{\text{with}}\, - y \le r_{i + 1} < y. $$
(11.22)

The Robertson diagram applied to radix-10 (B = 10) is depicted at Fig. 11.12. It defines the set of possible solutions: the dotted lines define the domain \( \left\{ {\left( { 10 \cdot r_{i} ,r_{i + 1} } \right)| - 10 \cdot y \le 10 \cdot r_{i} < 10 \cdot y\,{\text{and}}\, - y \le r_{i + 1} < y} \right\}, \) and the diagonals correspond to the equations \( r_{i + 1} = 10 \cdot r_{i} - ky \) with \( k \in \left\{ { - 10, - 9, \ldots , - 1,0, 1, \ldots , 9, 10} \right\}. \) If \( ky \le 10 \cdot r_{i} < \left( {k + 1} \right)y, \) there are two possible solutions for q −(i + 1), namely k and k + 1. To the first one corresponds a non-negative value of r i + 1, and to the second one a negative value.

Fig. 11.12
figure 12

Robertson diagram for radix 10

11.4.1 Non-Restoring Division Algorithm

A slightly modified version of the base-10 non-restoring algorithm (Sect. 9.2.1) is used. For that y must be a normalized n-digit natural, that is \( 10^{n - 1} \le y < 10^{n} . \) The remainders r i satisfy the condition −y ≤ r i  < y and belong to the range −10n < r i  < 10n. Define \( w_{i} = { 1}0 \cdot r_{i} , \) so that

$$ - 10^{n + 1} < w_{i} < 10^{n + 1} . $$
(11.23)

Thus, w i is an (n + 2)-digit 10’s complement number. The selection of every digit \( q_{ - (i + 1 )} \) is based on a truncated value of w i , namely \( w^{\prime} = \left\lfloor {w_{i} /{ 1}0^{\alpha } } \right\rfloor , \) for some α that will be defined later, so that \( w_{i} - 10^{\alpha } < w^{\prime} \cdot 10^{\alpha } \le w_{i} \) and

$$ w^{\prime} \cdot 10^{\alpha } \le w_{i} < w^{\prime} \cdot 10^{\alpha } + 10^{\alpha } . $$
(11.24)

According to (11.23) and (11.24), \( - 10^{n + 1} - 10^{\alpha } < w^{\prime} \cdot 10^{\alpha } < 10^{n + 1} , \) so that

$$ - 10^{n + 1- \alpha } \le w^{'} < 10^{n + 1- \alpha } . $$
(11.25)

Thus, w′ is an (n + 2 − α)-digit 10’s complement number. Assume that a set of integer-valued functions m k (y), for k in {−10, −9,…, −1, 0, 1,…, 8, 9}, satisfying

$$ k \cdot y \le m_{k} \left( y \right) \cdot 10^{\alpha } < \left( {k + 1} \right) \cdot y - 10^{\alpha } $$
(11.26)

has been defined. The interval [k·y, (k + 1)·y − 10α] must include a multiple of 10α. Thus, y must be greater than or equal to 2·10α. Taking into account that y ≥ 10n − 1, the condition is satisfied if α ≤ n − 2.

The following property is a straightforward consequence of (11.24) and (11.26):

Property 11.1 If \( m_{k} \left( y \right) \le w^{'} < m_{k + 1} \left( y \right),\,{\text{then}}\,k \cdot y \le w_{i} < \left( {k + 2} \right) \cdot y. \)

According to the Robertson diagram of Fig. 11.12, a solution \( q_{ - (i + 1 )} \) can be chosen as follows:

$$ \begin{gathered} {\text{if}}\,w^{'} < m_{ - 9} \left( y \right)\,{\text{then}}\,q_{ - (i + 1)} = - 9, \hfill \\ {\text{if}}\,w^{'} \ge m_{ 8} \left( y \right)\,{\text{then}}\,q_{ - (i + 1)} = 9, \hfill \\ {\text{if}}\,m_{k} \left( y \right) \le w^{'} < m_{k + 1} \left( y \right)\,{\text{for}}\,{\text{some}}\,k\,{\text{in }}\left\{ { - 9, - 8, \ldots , - 1,0, 1, \ldots , 7} \right\},\,{\text{then}}\,q_{ - (i + 1)} = k + 1. \hfill \\ \end{gathered} $$

Thus, this non-restoring algorithm generates a p-digit decimal quotient 0.q −1 q −2q p and a remainder r p satisfying

$$ x = \left( {0 \cdot q_{ - 1} q_{ - 2} \ldots q_{ - p} } \right)y + r_{p} \cdot 10^{ - p} ,\,{\text{with}}\, - y \cdot 10^{ - p} \le r_{p} \cdot 10^{ - p} < y \cdot 10^{ - p} , $$
(11.27)

where every q i is a signed decimal digit. It can be converted to a decimal number by computing the difference between two decimal numbers\( \begin{gathered} pos = q^{\prime}_{ - 1} \cdot 10^{ - 1} + q^{\prime}_{ - 2} \cdot 10^{ - 2} + \cdots + q^{\prime}_{ - p} \cdot 10^{ - p} \,{\text{and}} \hfill \\ neg = q^{\prime\prime}_{ - 1} \cdot 10^{ - 1} + q^{\prime\prime}_{ - 2} \cdot 10^{ - 2} + \cdots + q^{\prime\prime}_{ - p} \cdot 10^{ - p} , \hfill \\ {\text{with}}\,q^{\prime}_{ - i} = q_{i} \,{\text{if}}\,q_{i} > 0,q^{\prime}_{ - i} = 0\,{\text{if}}\,q_{i} < \,0,\,q^{\prime\prime}_{ - 1} = q_{i} \,{\text{if}}\,q_{i} < 0,q^{\prime\prime}_{ - i} = 0\,{\text{if}}\,q_{i} > 0. \hfill \\ \end{gathered} \)It remains to define a set of integer-valued functions m k (y) satisfying (11.26). In order to simplify their computation, they should only depend on the most significant digits of y. The following definition satisfies (11.26):

$$ \begin{gathered} m_{k} \left( y \right) = \left\lfloor {k \cdot y^{\prime}/ 10} \right\rfloor + bias\,{\text{where}}\,y^{\prime} = \left\lfloor {y/ 10^{\alpha - 1} } \right\rfloor \,{\text{if}}\,k \ge 0\,{\text{and}} \hfill \\ m_{k} \left( y \right) = - \left\lfloor { - k \cdot y^{\prime}/ 10} \right\rfloor + bias\,{\text{if}}\,k < 0, \hfill \\ \end{gathered} $$

where bias is any natural belonging to the range 2 ≤ bias ≤ 6. With α = n-2, y′ as a 3-digit natural, and w′ and m k (y) are 4-digit 10’s complement numbers. In the following Algorithm 1 m k (y) is computed without adding up bias to \( \left\lfloor {k \cdot y^{\prime}/ 10} \right\rfloor {\text{or}} - \left\lfloor { - k \cdot y^{\prime}/ 10} \right\rfloor , \) and w′ is substituted by w′—bias.

Algorithm 11.3: Non-restoring algorithm for decimal numbers

The structure of the data path corresponding to Algorithm 11.3 is shown in Fig. 11.13. Additionally, two decimal shift registers storing pos and neg and an output subtractor are necessary. Alternatively the conversion could be done on-the fly. The computation time and complexity of the k_ys generation and range detection components are independent of n. The execution time of the iteration step is determined by the n-digit by 1-digit multiplier and by the (n + 1)-digit adder. The total computation time is O(p·n).

Fig. 11.13
figure 13

Non-restoring radix 10 algorithm data path

The simplified VHDL model decimal_divider_nr.vhd that describes the N by N digits divider that produces P digits of decimal result and the test bench (test_dec_div_seq.vhd) are available at the Authors’ web page.

11.4.2 An SRT-Like Division Algorithm

In order to make the computation time linear, a classical method consists of using carry-save adders and multipliers instead of ripple-carry ones. In this way, the iteration step execution time can be made independent of the number n of digits. Nevertheless, this poses the following problem: a solution \( q_{ - (i + 1)} \) of Eq. (9.2) must be determined in function of a carry-stored representation (s i , c i ) of r i , without actually computing r i  = s i  + c i . An algorithm similar to the SRT-algorithms for radix-2K dividers can be defined for decimal numbers (Sect. 9.2.3). Once again the divisor y must be assumed to be a normalized n-digit natural.

All along the algorithm execution, r i will be encoded in the form \( w_{i} = s_{i} + c_{i} \) (stored-carry encoding) where s i and c i are 10’s-complement numbers. Define w i and w′ as in previous section, that is \( w_{i} = 10 \cdot r_{i} ,\,{\text{and}}\,w^{\prime} = \left\lfloor {w_{i} / 10^{\alpha } } \right\rfloor . \) Thus, \( w_{i} = 10 \cdot s_{i} + 10 \cdot c_{i} \) and \( w^{\prime} = \left\lfloor {\left( { 10 \cdot s_{i} + 10 \cdot c_{i} } \right)/ 10^{\alpha } } \right\rfloor . \) Define also truncated values s t and c t of 10·s i and 10·c i , that is \( s_{t} = \left\lfloor { 10 \cdot s_{i} / 10^{\alpha } } \right\rfloor \) and \( c_{t} = \left\lfloor { 10 \cdot c_{i} / 10^{\alpha } } \right\rfloor , \) and let w″ be the result of adding s t and c t . The difference between \( w^{\prime} = \left\lfloor {\left( { 10 \cdot s_{i} + 10 \cdot c_{i} } \right)/ 10^{\alpha } } \right\rfloor \) and \( w^{\prime\prime} = \left\lfloor { 10 \cdot s_{i} / 10^{\alpha } } \right\rfloor + \left\lfloor { 10 \cdot c_{i} / 10^{\alpha } } \right\rfloor \) is the possible carry from the rightmost positions, so that \( w^{\prime} - 1\le w^{\prime\prime} \le w^{\prime}, \) and thus

$$ w^{\prime\prime} \le w^{\prime} \le w^{\prime\prime} + { 1}. $$
(11.28)

According to (11.25) and (11.28),

$$ - 10^{n + 1- \alpha } - 1\le w^{\prime\prime} < 10^{n + 1- \alpha } , $$
(11.29)

so that w″ is an (n + 3−α)-digit 10’s-complement number. The relation between w i and the estimate w″·10α of w i is deduced from (11.24) and (11.28):

$$ w^{\prime\prime} \cdot 10^{\alpha } \le w_{i} < w^{\prime\prime} \cdot 10^{\alpha } + 2 \cdot 10^{\alpha } . $$
(11.30)

Assume that a set of integer-valued functions M k (y), for k in {−10, −9,…, −1, 0, 1,…, 8, 9}, satisfying

$$ k \cdot y \le M_{k} \left( y \right) \cdot 10^{\alpha } < \left( {k + 1} \right) \cdot y - 2\cdot 10^{\alpha ,} $$
(11.31)

have been defined. The interval [k·y, (k + 1)·y − 2·10α[ must include a multiple of 10α. Thus y must be greater than or equal to 3·10α. Taking into account that y ≥ 10n − 1, once again the condition is satisfied if α ≤ n − 2. The following property is a straightforward consequence of (11.30) and (11.31).

Property If \( M_{k} \left( y \right) \le w^{\prime\prime} < M_{k + 1} \left( y \right),\,{\text{then}}\,k \cdot y \le w_{i} < \left( {k + 2} \right) \cdot y. \)

Thus, according to the Robertson diagram of Fig. 11.12, a solution \( q_{ - (i + 1)} \) can be chosen as follows:

$$ \begin{gathered} {\text{if}}\,w^{\prime\prime} < M_{ - 9} \left( y \right)\,{\text{then}}\,q_{ - (i + 1)} = - 9, \hfill \\ {\text{if}}\,w^{\prime\prime} \ge M_{ 8} \left( y \right)\,{\text{then}}\,q_{ - (i + 1)} = 9, \hfill \\ {\text{if}}\,M_{k} \left( y \right) \le w^{\prime\prime} < M_{k + 1} \left( y \right)\,{\text{for}}\,{\text{some}}\,k\,{\text{in}}\left\{ { - 9, - 8, \ldots , - 1,0, 1, \ldots , 7} \right\},\,{\text{then}}\,q_{ - (i + 1)} = k + 1. \hfill \\ \end{gathered} $$

This SRT-like algorithm generates a p-digit decimal quotient 0.q −1 q −2q p and a remainder r p satisfying (11.27), and can be converted to a decimal number by computing the difference between two decimal numbers as in non-restoring algorithm.

It remains to define a set of integer-valued functions M k (y) satisfying (11.31). In order to simplify their computation, they should only depend on the most significant digits of y. Actually, the same definition as in Sect. 11.3.1 can be used, that is

$$ \begin{gathered} M_{k} \left( y \right) = \left\lfloor {k \cdot y^{\prime}/ 10} \right\rfloor + bias\,{\text{where}}\,y^{\prime} = \left\lfloor {y/ 10^{\alpha - 1} } \right\rfloor \,{\text{if}}\,k \ge 0\,{\text{and}} \hfill \\ M_{k} \left( y \right) = - \left\lfloor { - k \cdot y^{\prime}/ 10} \right\rfloor + bias\,{\text{if}}\,k < 0. \hfill \\ \end{gathered} $$

In this case the range of bias is 3 ≤ bias ≤ 6. With α = n − 2, y′ as a 3-digit natural, w′ and m k (y) are 4-digit 10’s complement numbers, and w″ is a 5-digit 10’s complement number. In the following Algorithm 2 M k (y) is computed without adding up bias to \( \left\lfloor {k \cdot y^{\prime}/ 10} \right\rfloor \,{\text{or}}\, - \left\lfloor { - k \cdot y^{\prime}/ 10} \right\rfloor , \) and w″ is substituted by w″—bias.

Algorithm 11.4: SRT like algorithm for decimal numbers

The structure of the data path corresponding to Algorithm 11.4 is shown in Fig. 11.4. The carry-free multiplier is a set of 1-digit by 1-digit multipliers working in parallel. Each of them generates two digits p 1, j + 1 and p 0, j such that \( q_{{ - \left( {i + 1} \right)}} \cdot y_{j} = 10 \cdot p_{ 1,j + 1} + p_{0,j} , \) and the product \( q_{{ - \left( {i + 1} \right)}} \cdot y \) is obtained under the form p 1 + p 0. The 4-to-2 counter computes 10·s i  + 10·c i  − (p 1 + p 0), that is 10·r i  − q −(i + 1)·y, under the form s i + 1 + c i + 1. Two decimal shift registers storing pos and neg and an output ripple-carry subtractor are necessary. Another output ripple-carry adder is necessary for computing the remainder r p  = s p  + c p . Thus, all the components, but the output ripple-carry components, have computation times independent of n and p. The total computation time is O(p + n).

The VHDL model decimal_divider_SRT_like.vhd that describes the N by N digits divider that produces P digits decimal result with several modules (decimal_shift_register.vhd, mult_Nx1_BCD_carrysave.vhd, special_5digit_adder.vhd, range_detection3.vhd, bcd_csa_addsub_4to2.vhd) and the test bench (test_dec_div_seq.vhd) are available at the Authors’ web page.

11.4.3 Other Methods for Decimal Division

Other methods of division could be considered such as the use digit recoding in dividend and or divisor and also use extra degree of pre-normalization for the operands. The idea behind these methods is to ease the digit selection process.

Another idea is the use of the binary digit recurrence algorithm of describe in Chap. 9 but using decimal operands.

Observe that the Algorithm 9.1 can be executed whatever the representation of the numbers. If B’s complement radix-B representation is used, then the following operations must be available: radix-B doubling, adding, subtracting and halving.

Algorithm 11.5: Binary digit-recurrence algorithm for decimal division

In order to obtain m decimal digits results, you need to choose p so that 2p ≅ 10m, that is \( p \cong m \cdot log_{ 2} 10 \cong 3. 3\cdot m. \) A possible data path is shown in Fig. 11.14. The implementation of this architecture leads to a smaller circuit, but slower than the decimal digit recurrence due to the difference in the amount of cycles to be executed.

Fig. 11.14
figure 14

Binary digit-recurrence data path for decimal division

11.5 FPGA Implementation Results

The circuits have been implemented on Xilinx Virtex-5 family with speed grade -2 [6]. The Synthesis and implementation have been carried out on XST (Xilinx Synthesis Technology) [7] and Xilinx Integrated System environment (ISE) version 13.1 [2]. The critical parts were designed using low level components instantiation (lut6_2, muxcy, xorcy, etc.) in order to obtain the desired behavior.

11.5.1 Adder-Subtractor Implementations

The adder and adder-subtractor implementation results are presented in this section. Performances of different N-digit BCD adders have been compared to those of an M-bit binary carry chain adder (implemented by XST) covering the same range, i.e. such that \( M = \left\lfloor {N.log_{ 2} \left( { 10} \right)} \right\rfloor \cong 3. 3 2 2N. \)

Table 11.1 exhibits the post placement and routing delays in ns for the decimal adder implementations Ad-I and Ad-II of Sect. 7.1; and the delays in ns for the decimal adder-subtractor implementations AS-I and AS-II of Sect. 7.2. Table 11.2 lists the consumed areas expressed in terms of 6-input look-up tables (6-input LUTs). The estimated area presented in Table 11.2 was empirically confirmed.

Table 11.1 Delays in ns for decimal and binary adders and adder-subtractor
Table 11.2 Area in 6-input LUTs for different decimal adders and adders-subtractors

Comments

  1. 1.

    Observe that for large operands, the decimal operations are faster than the binary ones.

  2. 2.

    The delay for the carry-chain adder and the adder-subtractor are similar in theory. The small difference is due to the placement and routing algorithm.

  3. 3.

    The overall area with respect to binary computation is not negligible. In Xilinx 6-input LUT family an adder-subtractor is between 3 and 4 times bigger.

11.5.2 Multiplier Implementations

The decimal multipliers use the one by one digit multiplication described in Sect. 11.2.1 and the decimal adders of Sect. 11.1. The results are for the same Virtex 5 device speed grade −2.

11.5.2.1 Decimal N × 1 Digits Implementation Results

Comparative figures of merit (minimum clock period (T) in ns, and area in LUTs and BRAMS) of the N × 1 multiplier are shown in Table 11.3, for several values of N; the result are shown for circuits using LUTs, and BRAMS to implement the digit by digit multiplication respectively.

Table 11.3 Results of BCD N × 1 multipliers using LUTs cells and BRAM cells

11.5.2.2 Sequential Implementations

The sequential circuit multiplies N digits by 1 digit per clock cycle, giving the result M + 1 cycles later. In order to speed up the computation, the addition of partial results is processed in parallel, but one cycle later (Fig. 11.10). Results for sequential implementation using LUT based cells for 1 × 1 BCD digit multiplication are given in Table 11.4. If the BRAM-based cell is used, similar periods (T) can be achieved, by means of less LUTs but using BRAMs blocks.

Table 11.4 Results of sequential implementations of N × M multipliers using one by one digit multiplication in LUTs

11.5.2.3 Combinational Implementations of N by M Multipliers

For the implementation of the N × M-digit multiplier, the N × 1 mux-based multiplication stage has been replicated M times: it is the best choice because BRAM-based multipliers are synchronous. Partial products are inputs to an addition tree. For all BCD additions the fast carry-chain adder of Sect. 11.1.4 has been used.

Input and output registers have been included in the design. Delays include FF propagation and connections. The amount of FF’s actually used is greater than 8*(M + N) because the ISE tools [2] replicate the input register in order to reduce fan-outs. The most useful data for area evaluation is the number of LUT’s (Table 11.5).

Table 11.5 Results of combinational implementations of N × M multipliers

Comments

  1. 1.

    Observe that computation time and the required area are different for N by M than M by N. That is due mainly by the use of carry logic in the adder tree.

11.5.3 Decimal Division Implementations

The algorithms have been implemented in the same Xilinx Virtex-5 device as previous circuits in the chapter. The adders and multipliers used in division are the ones described in previous sections. The non-restoring like division circuit correspond to Algorithm 11.4 (Fig. 11.13), meanwhile the SRT-like to Algorithm 11.5 (Fig. 11.15). In tables the number of decimal digits of dividend and divider is expressed as N and the number of digits of quotient as P, meanwhile the period and latency in ns (Table 11.6, 11.7).

Fig. 11.15
figure 15

SRT-like radix-10 algorithm data path

Table 11.6 Result for non-restoring division algorithm
Table 11.7 Result for SRT-like division algorithm

Comments

  1. 1.

    Both types of dividers have approximately the same costs and similar delay. They are also faster than equivalent binary dividers. As an example, the computation time of an SRT-like 48-digit divider (n = p = 48) is about 50·10.9 = 545 ns, while the computation time of an equivalent binary non-restoring divider, that is a 160-bit one (48/log102 ≅ 160), is more than 900 ns.

  2. 2.

    On the other hand, the area of the 48-digit divider (4,607 LUTs) is about five times greater than that of the 160-bit binary divider (970 LUTs). Why so a great difference? On the one hand it has been observed that decimal computation resources (adders and subtractors) need about three times more slices than binary resources (Sects. 11.4.1 and 11.4.2), mainly due to the more complex definition of the carry propagate and carry generate functions, and to the final mod 10 reduction. On the other hand, the computation of the next quotient digit is much more complex than in the binary case.

11.6 Exercises

  1. 1.

    Implement a decimal comparator. Based on the decimal adder subtractor architecture modify it to implement a comparator.

  2. 2.

    Implement a decimal “greater than” circuit that returns ‘1’ if A ≥ B else ‘0’. Tip: Base your design on the decimal adder subtractor.

  3. 3.

    Implement a N × 2 digits circuit. In order to speed up computation analyze the use of a 4 to 2 decimal reducer.

  4. 4.

    Implement a N × 4 digits circuit using a 8 to 2 decimal reducer and only one carry save adder.

  5. 5.

    Design a N by M digit multiplier using the N × 2 or the N × 4 digit multiplier. Do you improve the multiplication time? What is the area penalty with respect to the use of a N × 1 multiplier?

  6. 6.

    Implement the binary digit-recurrence algorithm for decimal division (Algorithm 11.5). The key point is an efficient implementation of radix-B doubling, adding, subtracting and halving.